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Chapter  1 
Introduction 


Graphs  naturally  represent  information  ranging  from  links  be¬ 
tween  webpages,  to  users’  movie  preferences,  to  friendships  and 
communications  in  social  networks,  to  connections  between  vox¬ 
els  in  our  brains  (Figure  1.1).  Informally,  a  graph  is  a  mathemat¬ 
ical  model  for  pairwise  relations  between  objects.  The  objects 
are  often  referred  to  as  nodes  and  the  relations  between  them 
are  represented  by  links  or  edges,  which  define  influence  and 
dependencies  between  the  objects. 

Real-world  graphs  often  span  hundreds  of  millions  or  even  bil¬ 
lions  of  nodes  and  interactions  between  them.  Within  this  deluge 
of  interconnected  data,  how  can  we  extract  useful  knowledge  in  a  scalable  way  and  without 
flooding  ourselves  with  unnecessary  information?  How  can  we  find  the  most  important 
structures  and  effectively  summarize  the  graphs?  How  can  we  efficiently  visualize  them? 
How  can  we  start  from  little  prior  information  (e.g.,  few  vandals  and  good  contributors  in 
Wikipedia)  and  broaden  our  knowledge  to  all  the  entities  using  network  effects?  How  can 
we  make  sense  of  and  explore  multiple  phenomena  -represented  as  graphs-  at  the  same 
time?  How  can  we  detect  anomalies  that  indicate  critical  events,  such  as  a  cyber-attack  or 
disease  formation  in  the  human  brain?  How  can  we  summarize  temporal  graph  patterns, 
such  as  the  appearance  and  disappearance  of  an  online  community? 

This  thesis  focuses  on  fast  and  principled  methods  for  exploratory  analysis  of  one  or 
more  networks  in  order  to  gain  insights  into  the  above-mentioned  problems.  The  main 
directions  of  our  work  are:  (a)  summarization,  which  provides  a  compact  and  interpretable 
representation  of  one  or  more  graphs,  or  and  (b)  similarity,  which  enables  the  discovery  of 
clusters  of  nodes  or  graphs  with  related  properties.  We  provide  theoretical  underpinnings 
and  scalable  algorithmic  approaches  that  exploit  the  sparsity  in  the  data,  and  we  show 


Figure  1.1:  Brain  graph. 
The  nodes  correspond 
to  cortical  regions  and 
the  links  represent  water 
flow  between  them. 
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how  to  use  them  in  large-scale,  real-world  applications,  including  anomaly  detection  in 
static  and  dynamic  graphs  (e.g.,  email  communications  or  computer  network  monitoring), 
re-identification  across  networks,  cross-network  analytics,  clustering,  classification,  and 
visualization. 


1.1  Overview  and  Contributions 

This  thesis  is  organized  into  two  main  parts:  (i)  single-graph  exploration,  and  (ii)  multiple- 
graph  exploration.  We  summarize  the  main  problems  of  each  part  in  the  form  of  questions 
in  Table  1.1. 


Table  1.1:  Thesis  Overview. 


Part 

Research  Problem 

Chapter 

Graph  Summarization:  How  can  we  succinctly  describe  a 
large-scale  graph? 

3 

I:  Single  Graph 

Inference:  What  can  we  learn  about  all  the  nodes  given  prior 
information  for  a  subset  of  them? 

4,5 

Temporal  Summarization:  How  can  we  succinctly  describe  a 
set  of  large-scale,  time-evolving  graphs? 

6 

II:  Multiple  Graphs 

Graph  Similarity:  What  is  the  similarity  between  two  graphs? 

7 

Which  nodes  and  edges  are  responsible  for  their  difference? 

Graph  Alignment:  How  can  we  efficiently  align  two  bipartite 
or  unipartite  graphs? 

8 

1.1.1  Part  I:  Single-Graph  Exploration 

At  a  macroscopic  level,  how  can  we  extract  easy-to-understand  building  blocks  from  a 
massive  graph  and  make  sense  of  its  underlying  phenomena?  At  a  microscopic  level,  after 
obtaining  some  knowledge  about  the  graph  structures,  how  can  we  further  explore  the 
nodes  and  find  important  node  patterns  (regular  or  anomalous)?  Our  work  proposes 
scalable  ways  for  summarizing  large-scale  information  by  leveraging  global  and  local 
graph  properties.  Summarization  of  massive  data  enables  its  efficient  visualization,  guides 
focus  on  its  important  aspects,  and  thus  is  key  for  understanding  the  data. 
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Graph  Summarization 

“How  can  we  succinctly  describe  a  l 

A  direct  way  of  making  sense  of  a  graph  is  to  model 
it  at  a  macroscopic  level  and  summarize  it  (Chap¬ 
ter  3).  Our  method,  VoG  [KKVF14,  KKVF11  ],  aims 
at  succinctly  describing  a  million-node  graph  with 
just  a  few,  possibly-overlapping  structures,  which  can 
be  easily  understood.  While  visualization  of  large 
graphs  often  fails  due  to  memory  requirements,  or 
results  in  a  clutter  of  nodes  and  edges  without  any 
visible  information,  our  method  enables  effective  vi¬ 
sualization  by  focusing  on  a  handful  of  important, 
simple  structures. 

We  formalize  the  graph  summarization  problem 
as  an  information-theoretic  optimization  problem,  where  the  goal  is  to  find  the  hidden 
local  structures  that  collectively  minimize  the  global  description  length  of  the  graph.  In 
addition  to  leveraging  the  Minimum  Description  Length  principle  to  find  the  best  graph 
summary,  another  core  idea  is  the  use  of  a  predefined  vocabulary  of  structures  that  are 
semantically  meaningful  and  ubiquitous  in  real  networks:  cliques  and  near-cliques,  stars, 
chains  and  (near-)  bipartite  cores.  VoG  is  near-linear  on  the  edges  of  the  graph  and 
finds  interesting  patterns,  such  as  edit  wars  between  admins  and  vandals  in  Wikipedia 
collaboration  networks. 

Contributions: 

•  Methodology:  VoG  provides  a  parameter-free  way  of  summarizing  an  unweighted, 
undirected  graph.  It  enables  visualization  and  guides  the  attention  to  important 
structures  within  graphs  with  billions  of  nodes  and  edges. 

•  Effectiveness  on  real  data:  Analysis  of  a  variety  of  real-world  graphs,  including  social 
networks,  web  graphs,  and  co-edit  Wikipedia  graphs,  shows  that  they  do  have 
structure,  which  allows  for  better  compression  and  more  compact  representation. 

Impact: 

•  VoG  was  selected  as  one  of  the  best  papers  of  SDM’14. 

•  It  is  taught  in  a  graduate  class  at  Saarland  University  (Topics  in  Algorithmic  Data 
Analysis) . 

•  VoG  contributes  to  the  detection  of  insider  threat  in  DARPA’s  $35  million  project 
ADAMS. 


•-scale  graph? 


(a)  Stars  (their 
hubs  are  in  red 
and  correspond 
to  editors,  heavy 
users,  bots). 


(b)  Bipartite  core  - 
‘edit  war’:  warring 
factions  (one  in  the 
top-left  red  circle), 
reverting  their  edits. 


Figure  1.2:  Summarization  by  VoG  of  a 
controversial  Wikipedia  graph. 
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Node  Similarity  (proximity)  as  Further  Exploration 


What  can  we  learn  about  all  the  nodes  given  prior  informationfor  a  subset  of  them? 

After  gaining  knowledge  about  the  important  graph  structures  and  their  underlying 
behaviors  through  graph  summarization  (e.g.,  by  using  VoG),  how  can  we  extend  our 
knowledge  and  find  similar  nodes  within  a  graph,  at  a  microscopic  level?  For  example, 
suppose  that  we  know  a  class-label  (say,  the  type  of  contributor  in  Wikipedia,  such  as 
vandals/admins)  for  some  of  the  nodes  in  the  graph.  Can  we  infer  who  else  is  a  vandal  in 
the  network?  This  is  the  problem  we  address  in  Chapter  4.  The  semi-supervised  setting, 
where  some  prior  information  is  available,  appears  in  numerous  domains,  including  law 
enforcement,  fraud  detection  and  cyber  security.  Among  the  most  successful  methods 
that  attempt  to  solve  the  problem  are  the  ones  that  perform  inference  by  exploiting  the 
global  network  structure  and  local  homophily  effects  (“birds  of  a  feather  flock  together”). 
Starting  from  belief  propagation,  a  powerful  technique  that  handles  both  homophily  and 
heterophily  in  networks,  we  have  mathematically  derived  an  accurate  and  faster  (2x) 
linear  approximation,  FaBP,  with  convergence  guarantees  [KKK+  ].  We  note  that  the 
original  belief  propagation  algorithm,  which  is  iterative  and  based  on  message-passing  is 
not  guaranteed  to  converge  in  loopy  graphs,  which  is  the  most  prevalent  type  of  real-world 
graphs.  The  derived  formula  revealed  the  equivalence  of  FaBP  to  random  walks  with 
restarts  and  semi-supervised  learning,  and  led  to  the  unification  of  the  three  guilt-by- 
association  methods.  In  Chapter  5  we  also  present  LinBP  [GGKF15],  which  extends  FaBP 
to  the  multi-class  setting  (e.g.,  classification  of  webpages  to  liberal,  conservative  and 
centrist) . 

Contributions: 

•  Methodology  1 :  FaBP  provides  a  closed  formula  for  belief  propagation  and  conver¬ 
gence  guarantees.  The  derived  linear  system  enables  solving  efficiently  the  inference 
problem  with  standard  linear  system  solvers. 

•  Methodology  2:  FaBP  and  LinBP  handle  heterophily  and  homophily  in  two-  or 
multi-class  settings. 

•  Effectiveness  on  real  data:  We  applied  FaBP  to  perform  node  classification  on  a  web 
snapshot  collected  by  Yahoo  with  over  6  billion  links  between  webpages.  We  ran  the 
experiment  on  Yahoo’s  M45  Hadoop  cluster  with  480  machines. 

Impact: 

•  FaBP  work  is  taught  at  Rutgers  University  (16:198:672)  and  CMU  (47-953). 
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1.1.2  Part  II :  Multiple-Graph  Exploration 

In  many  applications,  it  is  necessary  or  at  least  beneficial  to  explore  multiple  graphs  at 
the  same  time.  These  graphs  can  be  temporal  instances  on  the  same  set  of  objects  (time- 
evolving  graphs),  or  disparate  networks  coming  from  different  sources.  At  a  macroscopic 
level,  how  can  we  extract  easy-to-understand  building  blocks  from  a  series  of  massive 
graphs  and  summarize  the  dynamics  of  their  underlying  phenomena  (e.g.,  communication 
patterns  in  a  large  phone-call  network)?  How  can  we  find  anomalies  in  a  time-evolving 
corporate-email  correspondence  network  and  predict  the  fall  of  a  company?  Are  there 
differences  in  the  brain  wiring  of  more  creative  and  less  creative  people?  How  do 
different  types  of  communication  (e.g.,  messages  vs.  wall  posts  in  Facebook)  and  their 
corresponding  behavioral  patterns  compare?  Our  work  proposes  scalable  ways:  (a)  for 
summarizing  large-scale  temporal  information  by  extending  our  ideas  on  single-graph 
summarization  (Chapter  3),  and  (b)  for  comparing  and  aligning  graphs,  which  are  often 
the  underlying  problems  in  applications  with  multiple  graphs. 

Temporal  Graph  Summarization 

How  can  we  succinctly  describe  a  set  of  large-scale,  time-evolving  graphs? 

Just  like  in  the  case  of  a  single  graph,  a  natural  way  of  making  sense  of  a  series  of  graphs 
is  to  model  them  at  a  macroscopic  level  and  summarize  them  (Chapter  6) .  Our  method, 
TimeCrunch  [  KZ  1  ],  manages  to  succinctly  describe  a  large,  dynamic  graph  with  just 
a  few  phrases.  Even  visualizing  a  single  large  graph  fails  due  to  memory  requirements  or 
results  in  a  clutter  of  nodes  and  edges  without  any  useful  information.  Making  sense  of  a 
large,  time-evolving  graph  introduces  even  more  challenges,  so  detecting  simple  temporal 
structures  is  crucial  for  visualization  and  understanding. 

Extending  our  work  on  single  graph  summarization  presented  in  Chapter  3,  we  formal¬ 
ize  the  temporal  graph  summarization  problem  as  an  information-theoretic  optimization 
problem,  where  the  goal  is  to  identify  the  temporal  behaviors  of  local  static  structures  that 
collectively  minimize  the  global  description  length  of  the  dynamic  graph.  We  formulate 
a  lexicon  that  describes  various  types  of  temporal  behavior  (e.g.,  flickering,  periodic, 
one-shot)  exhibited  by  the  structures  that  we  introduced  for  summarization  of  static 
graphs  in  Chapter  3  (e.g.,  stars,  cliques,  bipartite  cores,  chains).  TimeCrunch  is  an 
effective  and  scalable  method  which  finds  interesting  patterns  in  a  dynamic  graph,  such 
as  ‘flickering  star’  communications  in  the  Enron  email  correspondence  graph,  and  ‘ranged 
cliques’  of  co-authors  in  DBLP  who  jointly  published  every  year  from  2007  to  2012  (mostly 
members  of  the  NIH  NCBI  community). 
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Contributions: 

•  Methodology:  TimeCrunch  provides  an  interpretable  way  of  summarizing  un¬ 
weighted,  undirected  dynamic  graphs  by  using  a  suitable  lexicon.  It  detects  tempo¬ 
rally  coherent  subgraphs  which  may  not  appear  at  every  timestep,  and  enables  the 
visualization  of  dynamic  graphs  with  hundreds  of  millions  of  nodes  and  interactions 
between  them. 

•  Effectiveness  on  real  data :  Analysis  of  a  variety  of  real-world  dynamic  graphs  - 
including  email  exchange,  instant  messaging  and  phone-call  networks,  computer 
network  attacks  and  co-authorship  graphs-  shows  that  they  are  structured  which 
allows  for  more  compact  representation. 


Graph  Similarity 

What  is  the  similarity  between  two  graphs? 

Which  nodes  and  edges  are  responsible  for  their  difference? 

Graph  similarity  (Chapter  7)  -the  problem  of  assessing  the  similarity  between  two  node- 
aligned  graphs-  has  numerous  high-impact  applications,  such  as  real-time  anomaly 
detection  in  e-commerce/computer  networks,  which  can  prevent  damage  of  millions  of 
dollars.  Although  it  is  a  long-studied  problem,  most  methods  do  not  give  intuitive  results 
and  ignore  the  ‘inherent’  importance  of  the  graph  edges  -  e.g.,  an  edge  connecting  two 
tightly-connected  components  is  often  assumed  as  important  as  an  edge  connecting  two 
nodes  in  a  clique.  Our  work  [  ,  ]  redefines  the  space  with  new  desired 

properties  for  graph  similarity  measures,  and  addresses  scalability  challenges.  Based 
on  the  new  requirements,  we  devised  a  massive-graph  similarity  algorithm,  DeltaCon, 
which  measures  the  differences  in  the  k-step  away  neighborhoods  in  a  principled  way 
that  uses  a  variant  of  Belief  Propagation  [  ]  introduced  in  Chapter  4  for  inference 

in  a  single  graph.  DeltaCon  takes  into  account  both  local  and  global  dissimilarities,  but 
in  a  weighted  manner:  local  dissimilarities  (smaller  k)  are  weighed  higher  than  global 
ones  (bigger  k) .  It  also  detects  the  nodes  and  edges  that  are  mainly  responsible  for  the 
difference  between  the  input  graphs  [KSV+15  ]. 

We  have  applied  our  similarity  work  to  (a)  detect  anomalous  events  in  time-evolving 
collaboration  and  social  networks  [  ,  ],  and  (b)  cluster  collaboration, 

technological  networks,  and  brain  graphs  with  up  to  90  million  edges.  DeltaCon  led  to  a 
fascinating  finding  in  the  space  of  neuroscience.  We  have  found  that  the  brain  wiring  of 
creative  and  non-creative  people  is  significantly  different:  creative  subjects  have  more 
and  heavier  cross-hemisphere  connections  than  non-creative  subjects  (Figure  1.3). 
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(a)  Hierarchical  clustering  on  brain  graphs  (b)  Low-CCI  brain, 
based  on  their  edge  connectivities. 


(c)  High-CCI  brain. 


Figure  1.3:  (a):  Creative  subjects’  brains  are  wired  differently  than  the  rest  based  on  DeltaCon. 
Subjects  with  high  (up-arrows)  creativity  index  (CCI)  are  mostly  in  the  green  cluster,  (b)-(c):  The 
low-CCI  brain  has  fewer  and  less  heavy  cross-hemisphere  connections  than  the  high-CCI  brain. 


Contributions: 

•  Methodology :  We  designed  DeltaCon,  which  assesses  the  similarity  between  aligned 
networks  in  a  principled,  interpretable  and  scalable  way.  NetSimile  provides  a 
framework  for  computing  the  similarity  between  unaligned  networks  independent 
of  their  sizes. 

•  Effectiveness  on  real  data :  We  applied  our  work  to  numerous  real-world  applica¬ 
tions,  which  led  to  several  interesting  discoveries  including  the  difference  in  brain 
connectivity  between  creative  and  non-creative  people. 

Impact: 

•  DeltaCon  is  taught  in  a  graduate  class  at  Rutgers  University  (16:198:672). 

•  It  contributes  to  DARPA’s  $35  million  project  Anomaly  Detection  at  Multiple  Scales 
(ADAMS)  to  detect  insider  threats  in  the  government  and  military. 

•  We  are  working  on  DeltaCon’s  scientific  discovery  in  brain  graphs  with  experts  in 
neuroscience  (at  Johns  Hopkins  University  and  University  of  New  Mexico). 

Graph  Alignment 

How  can  we  efficiently  node-align  two  bipartite  or  unipartite  graphs? 

The  graph  similarity  work  introduced  in  Chapter  7  assumes  that  the  correspondence  of 
nodes  across  graphs  is  known,  but  this  does  not  always  hold  true.  Social  network  analysis, 
bioinformatics  and  pattern  recognition  are  just  a  few  domains  with  applications  that  aim 
at  finding  node  correspondence.  In  Chapter  8  we  handle  exactly  this  problem. 

Until  now  research  has  focused  on  the  alignment  of  unipartite  networks.  We  focused 
on  bipartite  graphs,  and  formulated  the  alignment  of  such  graphs  as  an  optimization 
problem  with  practical  constraints  (e.g.,  sparsity,  1-to-many  mapping)  and  developed  a 
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fast  algorithm,  BiG-Align  [  13],  to  solve  it.  The  key  in  solving  the  problem  of  global 

alignment  efficiently  is  a  series  of  optimizations  that  we  devised,  including  the  aggregation 
of  nodes  with  local  structural  equivalence  to  supernodes.  This  leads  to  huge  space  and 
time  savings:  with  careful  handling,  a  node  correspondence  submatrix  of  several  millions 
of  small  entries  can  be  reduced  to  just  one.  Based  on  our  formulation  for  bipartite  graphs, 
we  also  introduced,  Uni -Align,  an  alternative  way  of  effectively  and  efficiently  aligning 
unipartite  graphs. 

Contributions: 

•  Methodology:  We  introduced  an  alternative  way  of  aligning  unipartite  graphs  which 
outperforms  the  state-of-the-art  approaches. 

•  Effectiveness:  BiG-Align  is  up  to  lOOx  faster  and  2-4x  more  accurate  than  competitor 
alignment  methods. 

Impact: 

•  This  work  resulted  in  seven  patents,  one  of  which  obtained  “rate-1”  score  (the 
highest  score,  corresponding  to  ‘extremely  high  potential  business  value  for  IBM’) . 


1.2  Overall  Impact 

The  core  of  the  thesis  focuses  on  developing  fast  and  principled  algorithms  for  exploratory 
analysis  of  one  or  more  large-scale  networks  in  order  to  gain  insights  in  them  and 
understand  them.  Our  contributions  are  in  the  areas  of  single-  and  multiple-graph 
exploration,  within  which  we  focus  on  summarization  and  similarity  at  the  node  and 
graph  level.  The  work  has  broad  impact  on  a  variety  of  applications:  anomaly  detection 
in  static  and  dynamic  graphs,  clustering  and  classification,  cross-network  analytics,  re¬ 
identification  across  networks,  and  visualization  in  various  types  of  networks,  including 
social  networks  and  brain  graphs.  Finally,  our  work  has  been  used  in  the  following 
settings: 

•  Taught  in  graduate  classes:  Our  methods  on  node  inference  (FaBP,  LinBP),  graph 
similarity  (DeltaCon),  and  graph  summarization  (VoG)  are  being  taught  in  gradu¬ 
ate  courses  -  e.g.,  Rutgers  University  (16:198:672),  Tepper  School  at  CMU  (47-953), 
Saarland  University  (Topics  in  Algorithmic  Data  Analysis),  Virginia  Tech  (CS  6604). 

•  Used  in  real  world: 

■  Our  summarization  work  (VoG  [KKVF1  ]),  and  similarity  algorithm  (DeltaCon 
[KVF1  ■  ])  are  used  in  DARPA’s  Anomaly  Detection  at  Multiple  Scales  project 
(ADAMS)  to  detect  insider  threats  and  exfiltration  in  the  government  and  the 
military. 

■  Seven  patents  have  been  filed  at  IBM  on  our  graph  alignment  method.  One 
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of  them  received  rate-1  (the  top  rating,  corresponding  to  “extremely  high 
potential  business  value  for  IBM”) . 

•  Awards:  VoG  [KKVF14]  was  selected  as  one  of  the  best  papers  of  SDM’14. 

Next  we  present  background  in  graph  mining  and  introduce  useful  graph-theoretical 
notions  and  definitions. 
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Chapter  2 
Background 


In  this  chapter  we  introduce  the  main  notions  and  definitions  in  graph  theory  and  mining 
that  are  useful  for  understanding  the  methods  and  algorithms  described  in  this  dissertation. 
At  the  end  of  this  chapter  we  give  a  table  with  the  common  symbols  and  their  descriptions. 


2.1  Graphs 

We  start  with  the  definition  of  a  graph,  followed  by  the  different  types  of  graphs  (e.g., 
bipartite,  directed,  weighted),  and  the  special  cases  of  graphs  or  motifs  (e.g.,  star,  clique). 

Graph:  A  representation  of  a  set  of  objects  connected  by 
links  (Figure  2.1).  Mathematically,  it  is  an  ordered  pair 
G  =  (V,  £),  where  V  is  the  set  of  objects  (called  nodes 
or  vertices)  and  £  is  the  set  of  links  between  some  of  the 
objects  (also  called  edges). 

Nodes  or  Vertices:  A  finite  set  V  of  objects  in  a  graph. 

For  example,  in  a  social  network,  the  nodes  can  be  peo¬ 
ple,  while  in  a  brain  network  they  correspond  to  voxels 

or  cortical  regions.  The  total  number  of  nodes  in  a  graph  ^  un<hrected,  un- 

°  or  weighted  graph. 

is  often  denoted  as  |V|  or  n. 

Edges  or  Links:  A  finite  set  £  of  lines  between  objects  in  a  graph.  The  edges  represent 
relationships  between  the  objects  -  e.g.,  friendship  between  people  in  social  networks, 
water  flow  between  voxels  in  our  brains.  The  total  number  of  edges  in  a  graph  is  often 
denoted  as  |£|  or  m. 
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Neighbors:  Two  vertices  v  and  u  connected  by  an  edge  are  called  neighbors.  Vertex 
u  is  called  the  neighbor  or  adjacent  vertex  of  v.  In  a  graph  G,  the  neighborhood  of  a 
vertex  v  is  the  induced  subgraph  of  G  consisting  of  all  vertices  adjacent  to  v  and  all  edges 
connecting  two  such  vertices. 


Bipartite  Graph:  A  graph  that  does  not  contain  any 
odd-length  cycles.  Alternatively,  a  bipartite  graph  is  a 
graph  whose  vertices  can  be  divided  into  two  disjoint 
sets  U  and  V  such  that  every  edge  connects  a  vertex  in 
U  to  one  in  V  (Figure  2.2).  A  graph  whose  vertex  sets 
cannot  be  divided  into  disjoint  sets  with  that  property 
is  called  unipartite.  A  tree  is  a  special  case  of  bipartite 
graph. 


Figure  2.2:  A  bipartite  graph. 


Directed  Graph:  A  graph  whose  edges  have  a  direction 
associated  with  them  (Figure  2.3).  A  directed  edge  is 
represented  by  an  ordered  pair  of  vertices  (u,  v),  and  is 
illustrated  by  an  arrow  starting  from  u  and  ending  at 
v.  A  directed  graph  is  also  called  digraph  or  directed 
network.  The  directionality  captures  non-reciprocal 
relationships.  Examples  of  directed  networks  are  the 
who-follows-whom  Twitter  network  (an  arrow  starts  Figure  2.3:  A  directed  graph, 
from  the  follower  and  ends  at  the  followee)  or  a  phonecall  network  (with  arrows  from 
the  caller  to  the  callee) .  A  graph  whose  edges  are  unordered  pairs  of  vertices  is  called 
undirected. 


Weighted  Graph:  A  graph  whose  edges  have  a  weight 
associated  with  them  (Figure  2.4).  If  the  weights  of 
all  edges  are  equal  to  1,  then  the  graph  is  called  un¬ 
weighted.  The  weights  can  be  positive  or  negative, 
integers  or  decimal.  For  example,  in  a  phonecall 
network  the  weights  may  represent  the  number  of  calls 
between  two  people. 

Labeled  Graph:  A  graph  whose  nodes  or  edges  have 
a  label  associated  with  them  (Figure  2.6).  An  example 
of  a  vertex-labeled  graph  is  a  social  network  where  the 
names  of  the  people  are  known. 


Figure  2.4:  A  weighted  graph.  The 
width  of  each  edge  represents  its 
weight. 


Egonet:  The  egonet  of  node  v  is  the  induced  subgraph  of  G  which  contains  v,  its  neighbors, 
and  all  the  edges  between  them.  Alternatively,  it  is  the  1-step  neighborhood  of  node  v. 
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(a)  Clique 


(b)  Bipartite  core 


(c)  Star 


(d)  Chain 


Figure  2.5:  Special  cases  of  graphs. 

Simple  Graph:  An  undirected,  unweighted  graph  containing  no  loops  (edges  from  a 
vertex  to  itself)  or  multiple  edges. 

Complete  Graph:  An  undirected  graph  in  which  every  pair  of  distinct  vertices  is  connected 
by  a  unique  edge. 

Clique:  A  subgraph  where  every  two  distinct  vertices  are  adjacent. 

Bipartite  Core  or  Complete  Bipartite  Graph:  A  special  case  of  bipartite  graph  where 
every  vertex  of  the  first  set  (It)  is  connected  to  every  vertex  of  the  second  set  V.  It  is  also 
called  a  complete  bipartite  graph  and  it  is  denoted  as  KS;t,  where  s  and  t  are  the  number 
of  vertices  in  U  and  V,  respectively. 

Star:  A  complete  bipartite  graph  Kljt,  for  any  t.  The  vertex  in  set  U  is  called  the  central 
node  or  hub,  while  the  vertices  in  V  are  often  called  peripheral  nodes  or  spokes. 

Chain:  A  graph  that  can  be  drawn  so  that  all  of  its  vertices  and  edges  lie  on  a  single 
straight  line. 

Triangle:  A  3-node  complete  graph. 


2.2  Graph  Properties 

Degree:  The  degree  of  a  vertex  v  (denoted  as  d(v))  in  a  graph  G  is  the  number  of 
edges  incident  to  the  vertex,  i.e.,  the  number  of  its  neighbors.  In  a  directed  graph,  the 
in-degree  of  a  vertex  is  the  number  of  incoming  edges,  and  its  out-degree  is  the  number 
of  outgoing  edges.  Often,  the  degrees  of  the  vertices  in  a  graph  are  represented  compactly 
as  a  diagonal  matrix  D,  where  du  =  d(i).  The  degree  distribution  is  the  probability 
distribution  of  the  node  degrees  in  graph  G. 

PageRank:  The  PageRank  of  a  node  v  is  a  score  that  captures  its  importance  relevant 
to  the  other  nodes.  The  score  depends  only  on  the  graph  structure.  PageRank  is  the 
algorithm  used  by  Google  Search  to  rank  webpages  in  the  search  results  [BP98] . 
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Geodesic  Distance:  The  geodesic  distance  between  two  vertices  v  and  u  is  the  length  of 
the  shortest  path  between  them.  It  is  also  called  hop  count  or  distance. 

Node  eccentricity  or  radius:  The  eccentricity  or  radius  of  node  v  is  the  greatest  geodesic 
distance  between  v  and  any  other  vertex  in  the  graph.  Intuitively,  eccentricity  captures 
how  far  a  node  is  from  the  furthest  away  vertex  in  the  graph. 

Graph  Diameter:  The  diameter  of  a  graph  is  the  maximum  eccentricity  of  any  node 
in  the  graph.  The  smallest  eccentricity  over  all  the  vertices  in  the  graph  is  called  graph 
radius.  In  Figure  2.5(d),  the  diameter  of  the  chain  is  3,  and  its  radius  is  2. 

Connected  component:  In  an  undirected  graph,  a  connected  component  is  a  subgraph  in 
which  any  vertex  is  reachable  from  any  other  vertex  (i.e.,  any  two  vertices  are  connected 
to  each  other  by  paths),  and  which  is  connected  to  no  additional  vertices  in  the  graph.  A 
vertex  without  neighbors  is  itself  a  connected  component.  Intuitively,  in  co-authorship 
networks,  a  connected  component  corresponds  to  researchers  who  publish  together,  while 
different  components  may  represent  groups  of  researchers  in  different  areas  who  have 
never  published  papers  together. 

Participating  Triangles:  The  number  of  distinct  triangles  in  which  a  node  participates. 
Triangles  have  been  used  for  spam  detection,  link  prediction  and  recommendation  in 
social  and  collaboration  networks,  and  other  real-world  applications. 

Eigenvectors  and  Eigenvalues:  The  eigenvalues  (eigenvectors)  of  a  graph  G  are  defined 
to  be  the  eigenvalues  (eigenvectors)  of  its  corresponding  adjacency  matrix,  A.  Formally, 
a  number  A  is  an  eigenvalue  of  a  graph  G  with  adjacency  matrix  A  if  there  is  a  non¬ 
zero  vector  x  such  that  Ax  =  Ax.  The  vector  x  is  the  eigenvector  corresponding  to  the 
eigenvalue  A. 

The  eigenvalues  characterize  the  graph’s  topology  and  connectedness  (e.g.,  bipartite, 
complete  graph),  and  are  often  used  to  count  various  subgraph  structures,  such  as 
spanning  trees.  The  eigenvalues  of  undirected  graphs  (with  symmetric  adjacency  matrices) 
are  real.  The  (first)  principal  eigenvector  captures  the  centrality  of  the  graph’s  vertices 
-this  is  related  to  Google’s  PageRank  algorithm.  The  second  smallest  eigenvector  is  used 
for  graph  partitioning  via  spectral  clustering. 


2.3  Graph- theoretic  Data  Structures 

The  data  structure  used  for  a  graph  depends  on  its  properties  (e.g.,  sparse,  dense,  small, 
large)  and  the  algorithm  applied  to  it.  Matrices  have  big  memory  requirements,  and  thus 
are  preferred  for  small,  dense  graphs.  On  the  other  hand,  lists  are  better  for  large,  sparse 
graphs,  such  as  social,  collaboration,  and  other  real-world  networks. 
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Adjacency  matrix:  The  adjacency  matrix  of  a  graph  G  is  an 
n  x  n  matrix  A,  whose  element  at,  is  non-zero  if  vertex  i  is 
connected  to  vertex  j,  and  0  otherwise.  In  other  words,  it 
represents  which  vertices  of  the  graph  are  adjacent  to  which 
other  vertices.  For  a  graph  without  loops,  the  diagonal  el¬ 
ements  au  are  0.  The  adjacency  matrix  of  an  undirected 
graph  is  symmetric.  The  elements  of  the  adjacency  matrix 
of  a  weighted  graph  are  equal  to  the  weights  of  the  corre¬ 
sponding  edges. 

Incidence  list:  An  array  of  pairs  of  adjacent  vertices.  This  is  a  common  representation  for 
sparse,  real-world  networks,  because  of  its  efficiency  in  terms  of  memory  and  computation. 

Sparse  matrix:  A  matrix  in  which  most  of  the  elements  are  zero.  A  matrix  where  most 
of  the  entries  are  non-zero  is  called  dense.  Most  of  the  real-world  networks  (e.g.,  social, 
collaboration  and  phonecall  networks)  are  sparse. 

Degree  matrix:  Annxu  diagonal  matrix  D  that  contains  the  degree  of  each  node.  Its 
ith  element  du  represents  the  degree  of  node  i,  d(i).  The  adjacency  and  degree  matrices 
of  a  simple,  unweighted  and  undirected  chain  graph  is  given  in  Figure  2.7. 


A  = 


du.v) 


(v,u| 


Figure  2.6:  Adjacency  matrix 
of  a  simple,  unweighted  and 
undirected  graph. 
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Figure  2.7:  A  chain  of  4  nodes,  and  its  adjacency  and  degree  matrices. 


Laplacian  matrix:  Annxn  matrix  defined  as  L  =  D  —  A,  where  A  is  the  adjacency 
matrix  of  the  graph,  and  D  is  the  degree  matrix.  The  multiplicity  of  0  as  an  eigenvalue  of 
the  Laplacian  matrix  of  the  graph  is  equal  to  the  number  of  its  connected  components. 
The  Laplacian  matrix  arises  in  the  analysis  of  random  walks,  electrical  networks  on  graphs, 
spectral  clustering,  and  other  graph  applications. 
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2.4  Common  Symbols 


We  give  the  most  common  symbols  and  their  short  descriptions  in  Table  2.1.  Additional 
symbols  necessary  to  explain  the  proposed  methods  and  algorithms  are  provided  in  the 
corresponding  chapters. 

Table  2.1:  Common  Symbols  and  Definitions.  Bold  capital  letters  for  matrices;  lowercase  bold  letters 
for  vectors;  plain  font  for  scalars. 


Symbol 

Description 

G,  Gx 

graph,  xth  graph 

V 

set  of  nodes 

n  =  |V| 

number  of  nodes 

£ 

set  of  edges 

m  =  |£| 

number  of  edges 

d(v) 

degree  of  node  v 

Pr(v) 

PageRank  of  node  v 

r 

graph  radius 

i 

n  x  n  identity  matrix 

A 

n  x  n  adjacency  matrix  with  elements  e  R 

D 

n  x  n  diagonal  degree  matrix,  du  =  Y.)  avj 

L 

=  D  —  A  Laplacian  matrix 

Lnorm 

=  D-i/2LD-i/2  normalized  Laplacian  matrix 
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Part  I 

Single-Graph  Exploration 
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Exploring  a  Single  Graph:  Overview 


Graphs  naturally  represent  a  host  of  processes,  such  as  friendships  between  people  in  social 
networks,  collaborations  between  people  at  work,  or  even  water  flow  between  neurons  in  our 
brains.  Understanding  the  structures  and  patterns  in  a  single  graph  contributes  to  the  sense-making 
of  the  corresponding  natural  processes,  and  leads  to  ‘interesting’,  data-driven  questions.  In  this 
part  we  examine  two  ways  of  exploring  and  understanding  a  single,  large-scale  graph: 

•  Scalable  and  interpretable  summarization  of  the  graph  in  terms  of  important  graph  struc¬ 
tures,  which  enables  efficient  visualization  (Chapter  3); 

•  Fast,  approximate  inference,  which  can  be  used  to  classify  nodes  in  a  network  given  little 
prior  information  (Chapters  4  and  5) . 

For  each  thrust,  we  give  observations,  and  models  for  real-world  graphs  followed  by  efficient 
algorithms  to  explore  the  different  aspects  (important  structures,  inferred  labels,  and  anomalies) 
of  a  single  graph,  and  gain  a  better  understanding  of  the  processes  it  captures. 
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Chapter  based  on  work  that  appeared  at  SDM  2014  [KKVF1  ],  and  Stat  Anal  Data  Min  [KKVF1  ]. 


Chapter  3 

Graph  Summarization 


One  natural  way  to  understand  a  graph  and  its  underlying  processes  is  to  visualize  and  interact 
with  it.  However,  for  large  datasets  with  several  millions  or  billions  of  nodes  and  edges,  such 
as  the  Facebook  social  network,  even  loading  them  using  an  appropriate  visualization  software 
requires  significant  amount  of  time.  If  the  memory  requirements  are  met,  visualizing  the  graph  is 
possible,  but  the  result  is  a  ‘hairball’  without  obvious  patterns:  often  the  number  of  nodes  is  larger 
than  the  number  of  pixels  on  a  screen,  while,  at  the  same  time,  people  have  limited  capacity  for 
processing  information.  How  can  we  summarize  efficiently,  and  in  simple  terms,  which  parts  of 
the  graph  stand  out?  What  can  we  say  about  its  structure?  Its  edge  distribution  will  likely  follow  a 
power  law  [  ] ,  but  apart  from  that,  is  it  random?  The  focus  of  this  chapter  is  finding  short 

summaries  for  large  graphs,  in  order  to  gain  a  better  understanding  of  their  characteristics. 

Why  not  apply  one  of  the  many  community  detection,  clustering  or  graph-cut  algorithms  that 
abound  in  the  literature  [  ]ZB+04,  TDM08,  PSS  1  10,  KK99,  CH94],  and  summarize  the  graph 
in  terms  of  its  communities?  The  answer  is  that  these  algorithms  do  not  quite  serve  our  goal. 
Typically,  they  detect  numerous  communities  without  explicit  ordering,  so  a  principled  selection 
procedure  of  the  most  “important”  subgraphs  is  still  needed.  In  addition  to  that,  these  methods 
merely  return  the  discovered  communities,  without  characterizing  them  (e.g.,  clique,  star),  and 
thus,  do  not  help  the  user  to  gain  further  insights  into  the  properties  of  the  graph. 

We  propose  VoG,  an  efficient  and  effective  method  for  summarizing  and  understanding  large 
real-world  graphs.  In  particular,  we  aim  at  understanding  graphs  beyond  the  so-called  caveman 
networks  that  only  consist  of  well-defined,  tightly-knit  clusters,  which  are  known  as  cliques  and 
near-cliques  in  graph  terms. 

The  first  insight  is  to  best  describe  the  structures  in  a  graph  using  an  enriched  set  of  “vocabulary” 
terms:  cliques  and  near-cliques  (which  are  typically  considered  by  community  detection  methods), 
and  also  stars,  chains  and  (near)  bipartite  cores.  The  reasons  we  chose  these  “vocabulary”  terms 
are:  (a)  (near-)  cliques  are  included,  and  so  our  method  works  fine  on  caveman1  graphs,  and 

1  A  caveman  graph  arises  by  modifying  a  set  of  fully  connected  clusters  (caves)  by  removing  one 
edge  from  each  cluster  and  using  it  to  connect  to  a  neighboring  one  such  that  the  clusters  form  a  single 
loop  [Wat99] .  Intuitively,  a  caveman  graph  has  a  block-diagonal  matrix,  with  a  few  edge  additions  and 
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(a)  Original  Wiki 
Liancourt -Rocks 
graph  plotted  using 
the  ‘spring  embed¬ 
ded’  layout  [KK89]. 
No  structure  stands 
out. 


(b)  VoG:  8  out  of 
the  10  most  infor¬ 
mative  structures  are 
stars  (their  centers 
in  red — Wikipedia  ed¬ 
itors,  heavy  contribu¬ 
tors,  etc.). 


(c)  VoG:  The  most  infor¬ 
mative  bipartite  graph  — 
‘edit  war’  —  warring  fac¬ 
tions  (one  of  them,  in  the 
top-left  red  circle),  chang¬ 
ing  each-other’s  edits. 


(d)  VoG:  the  second 
most  informative  bipartite 
graph,  another  ‘edit  war’, 
this  one  between  vandals 
(bottom  left  circle  of  red 
points)  vs.  responsible 
editors  (in  white). 


Figure  3.1:  VoG:  summarization  and  understanding  of  those  structures  of  the  Wikipedia 
Liancourt-Rocks  graph  that  are  most  important  from  an  information-theoretic  point  of  view. 
Nodes  stand  for  Wikipedia  contributors  and  edges  link  users  who  edited  the  same  part  of  the  ar¬ 
ticle. 


(b)  stars  [LKF14],  chains  [TPSF01]  and  bipartite  cores  [KKR+99,  PSS+10]  appear  very  often,  and 
have  semantic  meaning  (e.g.,  factions,  bots)  in  the  tens  of  real  networks  we  have  seen  in  practice 
(e.g.,  the  IMDB  movie-actor  graph,  co-authorship  networks,  Netflix  movie  recommendations,  US 
Patent  dataset,  phonecall  networks) . 

The  second  insight  is  to  formalize  our  goal  using  the  Minimum  Description  Length  (MDL) 
principle  [Ris78]  as  a  lossless  compression  problem.  That  is,  by  MDL  we  define  the  best  summary 
of  a  graph  as  the  set  of  subgraphs  that  describes  the  graph  most  succinctly,  i.e.,  compresses  it 
best,  and,  thus,  may  help  a  human  understand  the  main  graph  characteristics  in  a  simple, 
non-redundant  manner.  A  big  advantage  is  that  our  approach  is  parameter-free ,  as  at  any  stage 
MDL  identifies  the  best  choice:  the  one  by  which  we  save  most  bits. 

Informally,  we  tackle  the  following  problem: 

Problem  Definition  1.  [Graph  Summarization  -  Informal] 

•  Given:  a  graph 

•  Find:  a  set  of  possibly  overlapping  subgraphs  to  most  succinctly  describe  the  given 
graph,  i.e.,  explain  as  many  of  its  edges  in  as  simple  terms  as  possible,  in  a  scalable  way, 
ideally  linear  on  the  number  of  edges. 

Our  contributions  can  be  summarized  as: 

1.  Problem  Formulation:  We  show  how  to  formalize  the  intuitive  concept  of  graph  under¬ 
standing  using  principled,  information-theoretic  arguments. 


deletions. 
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2.  Effective  and  Scalable  Algorithm:  We  design  VoG  which  is  near-linear  on  the  number  of 
edges.  Our  code  for  VoG  is  open-sourced  at  danaikoutra/SRC/vog.  tar. 

3.  Experiments  on  Real  Graphs:  We  empirically  evaluate  VoG  on  several  real,  public  graphs 
spanning  up  to  millions  of  edges  of  the  input  graph.  VoG  spots  interesting  patterns  like  ‘edit 
wars’  in  the  Wikipedia  graphs  (Figure  3.1). 

The  roadmap  for  this  chapter  is  as  follows.  First,  Section  3.1  gives  the  overview  and  motivation 
of  our  approach.  Next,  in  Section  3.2  and  Section  3.3  we  respectively  present  the  problem 
formulation  and  describe  our  method  in  detail.  We  empirically  evaluate  VoG  in  Section  3.4  using 
qualitative  and  quantitative  experiments  on  a  variety  of  real  graphs.  We  discuss  its  implications  and 
limitations  in  Section  3.5,  and  cover  related  work  in  Section  3.6.  We  summarize  our  contributions 
and  findings  in  Section  3.7. 


3.1  Proposed  Method:  Overview  and  Motivation 

Before  we  give  our  two  main  contributions  in  the  next  sections — problem  formulation,  and  the 
search  algorithm — ,  we  first  provide  the  high-level  outline  of  VoG,  which  stands  for  Vocabulary- 
based.  summarization  of  Graphs: 

(a)  We  use  MDL  to  formulate  a  quality  function:  a  collection  M  of  structures  (e.g.,  a  star  here, 
cliques  there,  etc.)  is  as  good  as  its  description  length  L(G,  M).  Hence,  any  subgraph  or  set 
of  subgraphs  has  a  quality  score. 

(b)  We  give  an  efficient  algorithm  for  characterizing  candidate  subgraphs.  In  fact,  we  allow 
any  subgraph  discovery  heuristic  to  be  used  for  this,  as  we  define  our  framework  in  general 
terms  and  use  MDL  to  identify  the  structure  type  of  the  candidates. 

(c)  Given  a  candidate  set  C  of  promising  subgraphs,  we  show  how  to  mine  informative  sum¬ 
maries,  removing  redundancy  by  minimizing  the  compression  cost. 

VoG  results  in  a  list  M  of,  possibly  overlapping  subgraphs,  sorted  in  order  of  importance  (com¬ 
pression  gain).  Together  these  subgraphs  succinctly  describe  the  main  connectivity  of  the  graph. 

The  motivation  behind  VoG  is  that  the  visualization  of  large  graphs  often  results  in  a  clutter 
of  nodes  and  edges,  and  hinders  interactive  exploration  and  discoveries.  On  the  other  hand,  a 
handful  of  simple,  ‘important’  structures  can  be  visualized  more  easily,  and  may  help  the  user 
understand  the  underlying  characteristics  of  the  graph.  Next  we  give  an  illustrating  example 
of  VoG,  where  the  most  ‘important’  vocabulary  subgraphs  that  constitute  a  Wikipedia  article’s 
(graph)  summary  are  semantically  interesting. 

Illustrating  Example:  In  Figure  3.1  we  give  the  results  of  VoG  on  a  Wikipedia  graph  based 
on  the  article  about  Liancourt-Rocks;  the  nodes  are  editors,  and  editors  share  an  edge 
if  they  edited  the  same  part  of  the  article.  Figure  3.1(a)  shows  the  graph  using  the  spring- 
embedded  model  [KK89] .  No  clear  pattern  emerges,  and  thus  a  human  would  have  hard  time 
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understanding  this  graph.  Contrast  that  with  the  results  of  VoG.  Figures  3.1(b)-3.1(d)  depict  the 
same  graph,  where  we  highlight  the  most  important  structures  (i.e.,  structures  that  save  the  most 
bits)  discovered  by  VoG.  The  discovered  structures  correspond  to  behavioral  patterns: 

•  Stars  — >  admins  (+  vandals ):  In  Figure  3.1(b),  with  red  color,  we  show  the  centers  of  the 
most  important  “stars”:  further  inspection  shows  that  these  centers  typically  correspond  to 
administrators  who  revert  vandalisms  and  make  corrections. 

•  Bipartite  cores  — >•  edit  wars :  Figures  3.1(c)  and  3.1(d)  give  the  two  most  important  near- 
bipartite-cores.  Manual  inspection  shows  that  these  correspond  to  edit  wars:  two  groups  of 
editors  reverting  each  others’  changes.  For  clarity,  we  denote  the  members  of  one  group  by 
red  nodes  (left),  and  highlight  the  edges  to  the  other  group  in  pale  yellow. 

3.2  Problem  Formulation 

In  this  section  we  describe  the  first  contribution,  the  MDL  formulation  of  graph  summarization.  To 
enhance  readability,  we  list  the  most  frequently  used  symbols  in  Table  3.1. 

In  general,  the  Minimum  Description  Length  principle  (MDL)  [  lis83]  is  a  practical  version 
of  Kolmogorov  Complexity  [  V93],  which  embraces  the  slogan  Induction  by  Compression.  For 
MDL,  this  can  be  roughly  described  as  follows.  Given  a  set  of  models  M,  the  best  model  MeM 
minimizes 

L(M)  +  L(B  |  M)  , 

where 


•  L(M)  is  the  length,  in  bits,  of  the  description  of  M,  and 

•  L('D  |  M)  is  the  length,  in  bits,  of  the  description  of  the  data  when  encoded  using  the 
information  in  M. 

This  is  called  two-part  or  crude  MDL,  as  opposed  to  refined  MDL,  where  model  and  data  are 
encoded  together  [  ] .  We  use  two-part  MDL  because  we  are  specifically  interested  in  the 

model:  those  graph  connectivity  structures  that  together  best  describe  the  graph.  Further,  although 
refined  MDL  has  stronger  theoretical  foundations,  it  cannot  be  computed  except  for  some  special 
cases. 

Without  loss  of  generality,  we  here  consider  undirected  graphs  G(V,  £)  of  n  =  |V|  nodes, 
and  m  =  |£|  edges,  with  no  self-loops.  Our  theory  can  be  straightforwardly  generalized  to 
directed  graphs — and  similarly  so  for  weighted  graphs,  has  an  expectation  or  is  willing  to  make  an 
assumption  on  the  distribution  of  the  edge  weights. 

To  use  MDL  for  graph  summarization,  we  need  to  define  what  our  models  M  are,  how  a  model 
MeM  describes  data,  and  how  we  encode  this  in  bits.  We  do  this  next.  It  is  important  to  note 
that  to  ensure  fair  comparison,  MDL  requires  descriptions  to  be  lossless,  and,  that  in  MDL  we  are 
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Table  3.1:  VoG:  Description  of  the  major  symbols  for  static  graph  summarization. 


Symbols 

Description 

G(V,£) 

A 

V,  n  =  |V| 
£,  m  =  £ 

graph 

adjacency  matrix  of  G 

node-set  and  number  of  nodes  of  G,  respectively 
edge-set  and  number  of  edges  of  G,  respectively 

fc,  nc 
fb,  nb 
st 

ch 

full  clique  and  near  clique,  respectively 
full  bipartite  core  and  near  bipartite  core,  respectively 
star  graph 
chain  graph 

ci 

e* 

e 

M 

s,  t  €  M 
area{  s) 

|S|,  |s| 
llslUlsll' 

vocabulary  of  structure  types,  e.g.,  Cl  C  {fc,  nc,fr,  nr,fb,  nb,  ch,  st} 
set  of  all  candidates  structures  of  type  x  €  O 
set  of  all  candidate  structures,  C  =  UXCX 

a  model  for  G,  essentially  a  list  of  node  sets  with  associated  structure  types 
structures  in  M 

edges  of  G  (=  cells  of  A)  described  by  s 

cardinality  of  set  S  and  number  of  nodes  in  s,  respectively 

number  of  existing,  resp.  non-existing  edges  within  the  area  of  A  that  s  describes 

M 

E 

© 

approximation  of  adjacency  matrix  Adeduced  by  M 
error  matrix,  E  =  M  ©  A 
exclusive  OR 

L(G,M) 

L(M) 

L(s) 

number  of  bits  to  describe  model  M,  and  G  using  M 
number  of  bits  to  describe  model  M. 
number  of  bits  to  describe  structure  s 

only  concerned  with  the  optimal  description  lengths  —  not  actual  instantiated  code  words  —  and 
hence  do  not  have  to  round  up  to  the  nearest  integer. 

3.2.1  MDL  for  Graph  Summarization 

As  models  M,  we  consider  ordered  lists  of  graph  structures.  We  write  Cl  for  the  set  of  graph 
structure  types  that  are  allowed  in  M,  i.e.,  that  we  are  allowed  to  describe  (parts  of)  the  input 
graph  with.  We  will  colloquially  refer  to  Cl  as  our  vocabulary.  Although  in  principle  any  graph 
structure  type  can  be  a  part  of  the  vocabulary,  we  here  choose  the  6  most  common  structures  in 
real-world  graphs  [  l+99,  PSS+  , '  ]  that  are  well-known  and  understood  by  the  graph 

mining  community:  full  and  near  cliques  (fc,  no),  full  and  near  bipartite  cores  (fb,nb ),  stars  (st) , 
and  chains  (ch).  Compactly,  we  have  O  =  \fc,nc,fb,nb,ch,st}.  We  will  formally  introduce  these 
types  after  formalizing  our  goal. 
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jclique 

p.ea.r-c 


clique  B 


A 


chain  D 


Figure  3.2:  Illustration  of  our  main  idea  on  a  toy  adjacency  matrix:  VoG  identifies  overlapping 
sets  of  nodes,  that  form  vocabulary  subgraphs  (cliques,  stars,  chains,  etc).  VoG  allows  for  the  soft 
clustering  of  nodes,  as  in  clique  A  and  near-clique  B.  Stars  look  like  inverted  L  shapes  (e.g.,  star  C). 
Chains  look  like  lines  parallel  to  the  main  diagonal  (e.g.,  chain  D). 

Each  structure  s  G  M  identifies  a  patch  of  the  adjacency  matrix  A  and  describes  how  it  is 
connected  (Figure  3.2).  We  refer  to  this  patch,  or  more  formally  the  edges  (i,  j)  e  A  that  structure 
s  describes,  as  area{ s,  M,  A),  where  we  omit  M  and  A  whenever  clear  from  context. 

We  allow  overlap  between  structures2:  nodes  may  be  part  of  more  than  one  structure.  We 
allow,  for  example,  cliques  to  overlap.  Edges,  however,  are  described  on  a  first-come-first-serve 
basis:  the  first  structure  s  e  M  to  describe  an  edge  (i,  j)  determines  the  value  in  A.  We  do  not 
impose  constraints  on  the  amount  of  overlap;  MDL  will  decide  for  us  whether  adding  a  structure 
to  the  model  is  too  costly  with  respect  to  the  number  of  edges  it  helps  to  explain. 

Let  Cx  be  the  set  of  all  possible  subgraphs  of  up  to  n  nodes  of  type  xeQ,  and  G  the  union  of 
all  of  those  sets,  G  =  UXGX.  For  example,  G/c  is  the  set  of  all  possible  full  cliques.  Our  model  family 
M  then  consists  of  all  possible  permutations  of  all  possible  subsets  of  G  -  recall  that  the  models 
M  are  ordered  lists  of  graph  structures.  By  MDL,  we  are  after  the  M  e  M  that  best  balances  the 
complexity  of  encoding  both  A  and  M. 

Our  general  approach  for  transmitting  the  adjacency  matrix  is  as  follows.  First,  we  transmit 
the  model  M.  Then,  given  M,  we  can  build  the  approximation  M  of  the  adjacency  matrix,  as 
defined  by  the  structures  in  M;  we  simply  iteratively  consider  each  structure  s  G  M,  and  fill  out 
the  connectivity  of  area(s)  in  M  accordingly.  As  M  is  a  summary,  it  is  unlikely  that  M  =  A.  Still,  in 
order  to  fairly  compare  between  models,  MDL  requires  an  encoding  to  be  lossless.  Hence,  besides 
M,  we  also  need  to  transmit  the  error  matrix  E,  which  encodes  the  error  with  respect  to  A.  We 
obtain  E  by  taking  the  exclusive  OR  between  M  and  A,  i.e.,  E  =  M  ©  A.  Once  the  recipient  knows 
M  and  E,  the  full  adjacency  matrix  A  can  be  reconstructed  without  loss. 

With  this  in  mind,  we  have  as  our  main  score 


L(G,  M)  =  L(M)  +  L(E), 


where  L(M)  and  L(E)  are  the  numbers  of  bits  that  describe  the  structures,  and  the  error  matrix  E, 


2  This  is  a  common  assumption  in  mixed-membership  stochastic  blockmodels. 
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respectively.  We  note  that  L(E)  maps  to  Lf'D  |  M),  introduced  in  Section  3.2.  That  is,  it  corresponds 
to  the  length,  in  bits,  of  the  description  of  the  data  when  encoded,  using  the  information  in  M. 
The  formal  definition  of  the  problem  we  tackle  in  this  work  is  defined  as  follows. 


Problem  Definition  2.  [Minimum  Graph  Description  Problem]  Given  a  graph  G  with  adjacency 
matrix  A,  and  the  graph  structure  vocabulary  Cl,  by  the  MDL  principle  we  are  after  the  smallest 
model  IVl  for  which  the  total  encoded  length  is  minimal,  that  is 

minL(G,  M)  =  min{L(M)  +  L(E)}, 

where  E  =  M  ©  A  is  the  error  matrix,  and  M  is  an  approximation  of  A  deduced  by  M. 

Next,  we  formalize  the  encoding  of  the  model  and  the  error  matrix. 

3.2.2  Encoding  the  Model 

For  the  encoded  length  of  a  model  M  e  M,  we  have 


L(M) 


#  of  structures,  in  total,  and  per  type  per  structure,  in  order,  type  and  details 


First,  we  transmit  the  total  number  of  structures  in  M  using  Ln,  the  MDL  optimal  encoding  for 
integers  greater  than  or  equal  to  1  [  lis83] .  Next,  by  an  index  over  a  weak  number  composition, 
we  optimally  encode  the  number  of  structures  of  each  type  x  €  O  in  model  M.  Then,  for  each 
structure  s  e  M,  we  encode  its  type  x(s)  with  an  optimal  prefix  code  [  1T06],  and  finally  its 
structure. 

To  compute  the  encoded  length  of  a  model,  we  need  to  define  L(s)  per  graph  structure  type  in 
our  vocabulary. 


Cliques 


To  encode  a  full  clique,  a  set  of  fully-connected  nodes  as  a  full  clique,  we  first  encode  the  number 
of  nodes,  and  then  their  IDs 


#  of  nodes  node  IDs 


For  the  number  of  nodes  we  re-use  Ln,  and  we  encode  their  IDs  by  an  index  over  an  ordered 
enumeration  of  all  possible  ways  to  select  fc\  nodes  out  of  n.  As  M  generalizes  the  graph,  we  do 
not  require  that  fc  is  a  full  clique  in  G.  If  only  few  edges  are  missing,  it  may  still  be  convenient  to 
describe  it  as  such.  Every  missing  edge,  however,  adds  to  the  cost  of  transmitting  E. 
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Less  dense  or  near-cliques  can  be  as  interesting  as  full-cliques.  We  encode  these  as  follows 


L  [nc]  =  Ln(|«c|)  +log 


n 


\nc\ 

#  of  nodes  node  IDs 


+  log(|area(nc)|)  +  ||wc||li  +  IMI'Iq  . 


#  of  edges 


edges 


We  first  transmit  the  number  and  IDs  of  nodes  as  above,  and  then  identify  which  edges  are  present 
and  which  are  not,  using  optimal  prefix  codes.  We  write  ||nc||  and  ||nc||'  for  respectively  the  number 
of  present  and  missing  edges  in  area{nc).  Then,  li  =  —  log((||«c||/(||nc||  +  ||nc||')),  and  analogue  for 
lo,  are  the  lengths  of  the  optimal  prefix  codes  for  present  and  missing  edges,  respectively.  The 
intuition  is  that  the  more  dense  (sparse)  a  near-clique  is,  the  cheaper  encoding  its  edges  will  be. 
Note  that  this  encoding  is  exact;  no  edges  are  added  to  E. 


Bipartite  Cores 

Bipartite  cores  are  defined  as  non-empty,  non-intersecting  sets  of  nodes,  A  and  B,  for  which  there 
are  edges  only  between  the  sets  A  and  B,  and  not  within. 

The  encoded  length  of  a  full  bipartite  cor efb  is 

LO)  =  Ln(|A|)+Ln(|B|) 

V - V - ' 

cardinality  of  A  and  B,  resp. 

where  we  encode  the  size  of  A,  B,  and  then  the  node  IDs. 

Analogue  to  cliques,  we  also  consider  near  bipartite  cores,  nb,  where  the  core  is  not  (necessarily) 
fully  connected.  To  encode  a  near  bipartite  core  we  have 

L  {nb)  =  LN(|A|)+LN(|B|) 

' - V - ' 

cardinality  of  A  and  B,  resp. 


+  log 


T  log 


TT  —  |  A- 1 

IBI 


node  IDs  in  B 


node  IDs  in  A 


node  IDs  in  B 


+  log(|area(nb)|)  +  ||nfe||li  +  Hn&H'lo  . 


number  of  edges  edges 


Stars 

A  star  is  specific  case  of  the  bipartite  core  that  consists  of  a  single  node  (hub)  in  A  connected  to  a 
set  B  of  at  least  2  nodes  (spokes).  For  Lfyt)  of  a  given  star  st  we  have 


L{st)  =  LN(M-  1)  +  logn  -l-log 


n-  1 

M  - 1 


number  of  spokes  id  of  hub  node  ids  of  spoke  nodes 
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where  M  —  1  is  the  number  of  spokes  of  the  star.  To  identify  the  member  nodes,  we  first  identify 
the  hub  out  of  rt  nodes,  and  then  the  spokes  from  the  remaining  nodes. 

Chains 

A  chain  is  a  list  of  nodes  such  that  every  node  has  an  edge  to  the  next  node,  i.e.  under  the  right 
permutation  of  nodes,  A  has  only  the  super-diagonal  elements  (directly  above  the  diagonal) 
non-zero.  As  such,  for  the  encoded  length  L [ch]  for  a  chain  ch  we  have 

\ch\ 

L [ch)=  Ln(|c/j|-1)  +  login -i), 

i=0 

V - V - '  V - V - ' 

#  of  nodes  in  chain  node  IDs,  in  order  of  chain 

where  we  first  encode  the  number  of  nodes  in  the  chain,  and  then  their  IDs  in  order.  Note  that 
Li=0  los(n  -  0  <  \ch\  logn,  and  hence  by  MDL  is  the  better  (i.e.,  as  it  is  more  efficient)  way  of 
the  two  to  encode  the  member  nodes  of  a  chain. 

3.2.3  Encoding  the  Errors 

Next,  we  discuss  how  we  encode  the  errors  made  by  M  with  regard  to  A,  store  this  information 
in  the  error  matrix  E.  Many  different  approaches  exist  for  encoding  the  errors — amongst  which 
appealing  at  first  glance  is  to  simply  identify  all  node  pairs.  However,  it  is  important  to  realize  that 
the  more  efficient  our  encoding  is,  the  less  spurious  ‘structure’  will  be  discovered. 

We  hence  follow  [  ]  and  encode  E  in  two  parts,  E+  and  E_ .  The  former  corresponds  to 

the  area  of  A  that  M  does  model,  and  for  which  M  includes  superfluous  edges.  Analogue,  E 
consists  of  the  area  of  A  not  modeled  by  M,  for  which  M  lacks  edges.  We  encode  these  separately 
as  they  are  likely  to  have  different  error  distributions.  Note  that  since  we  know  that  near  cliques 
and  near  bipartite  cores  are  encoded  exactly,  we  ignore  these  areas  in  E+ .  We  encode  the  edges  in 
E+  and  E_  similarly  to  how  we  encode  near-cliques,  and  have 

L(E+)  =  log(|E+|)  +  IIE+HU  +  IIE+H'lo 

L(E-)  =  log(|E-|)  +  jlE-Hl!  +  ||E-|rio  . 

#  of  edges  edges 

That  is,  we  first  encode  the  number  of  Is  in  E+  (  respectively  E_),  after  which  we  transmit  the 
Is  and  Os  using  optimal  prefix  codes  of  length  l|  and  lo.  We  choose  to  use  prefix  codes  over  a 
binomial  for  practical  reasons,  as  prefix  codes  allow  us  to  easily  and  efficiently  calculate  accurate 
local  gain  estimates  in  our  algorithm,  without  sacrificing  much  encoding  efficiency  (typically  <  1 
bit  in  practice). 
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Size  of  the  Search  Space.  Clearly,  for  a  graph  of  rt  nodes,  the  search  space  M  we  have  to  consider 
for  solving  the  Minimum  Graph  Description  Problem  is  enormous,  as  it  consists  of  all  possible 
permutations  of  the  collection  C  of  all  possible  structures  over  the  vocabulary  O.  Unfortunately, 
it  does  not  exhibit  trivial  structure,  such  as  (weak)  (anti)  monotonicity,  that  we  could  exploit  for 
efficient  search.  Further,  Miettinen  and  Vreeken  [  ]  showed  that  for  a  directed  graph  finding 

the  MDL  optimal  model  of  only  full-cliques  is  NP-hard.  Hence,  we  resort  to  heuristics. 


3.3  VoG:  Summarization  Algorithm 

Now  that  we  have  the  arsenal  of  graph  encoding  based  on  the  vocabulary  of  structure  types,  O, 
we  move  on  to  the  next  two  key  ingredients:  finding  good  candidate  structures,  i.e.,  instantiating 
C,  and  then  mining  informative  graph  summaries,  i.e.,  finding  the  best  model  M.  An  illustration 
of  the  algorithm  is  given  in  Figure  3.3.  The  pseudocode  of  VoG  is  given  in  Algorithm  3.1,  and  the 
code  is  available  for  research  purposes  at  www .  cs  .  emu .  edu/~dkoutra/SRC/VoG .  tar . 


3.3.1  Step  1:  Subgraph  Generation 

Any  combination  of  clustering  and  community  detection  algorithms  can  be  used  to  decompose  the 
graph  into  subgraphs,  which  need  not  be  disjoint.  These  techniques  include,  but  are  not  limited  to 
Cross-asssociations  [  ]ZB +04],  Subdue  [CH94],  SlashBurn  [LKF1  ],  Eigenspokes  [PSS'h10],  and 
METIS  [KK99] . 

3.3.2  Step  2:  Subgraph  Labeling 

Given  a  subgraph  from  the  set  of  clusters  or  communities  discovered  in  the  previous  step,  we 
search  for  the  structure  x  €  O  that  best  characterizes  it,  with  no  or  some  errors  (e.g.,  perfect 
clique,  or  clique  with  some  missing  edges,  encoded  as  error). 

Step  2.1:  Labeling  Perfect  Structures 

First,  the  subgraph  is  tested  against  the  vocabulary  structure  types  (full  clique,  full  bipartite  core, 
star  and  chain)  for  error-free  match.  The  test  for  clique  or  chain  is  based  on  its  degree  distribution. 
Specifically,  if  all  the  nodes  in  the  subgraph  of  size  n  have  degree  n  —  1,  then  it  is  a  clique.  Similarly, 
if  all  the  nodes  have  degree  2  except  for  two  nodes  with  degree  1,  the  subgraph  is  a  chain.  On  the 
other  hand,  a  subgraph  is  bipartite  if  the  magnitudes  of  its  maximum  and  minimum  eigenvalues 
are  equal.  To  find  the  node  IDs  in  the  two  node  sets,  A  and  B,  we  use  Breadth  First  Search  (BFS) 
with  node  coloring.  We  note  that  a  star  is  a  special  case  of  a  bipartite  graph,  where  one  set  consists 
of  only  one  node.  If  one  of  the  node  sets  has  size  1,  then  the  given  substructure  is  encoded  as  star. 
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Figure  3.3:  Illustration  of  VoG  step-by-step. 


Step  2.2:  Labeling  Approximate  Structures 

If  the  subgraph  does  not  have  a  “perfect”  structure  (i.e.,  it  is  not  a  full  clique,  full  bipartite  core, 
star,  or  chain),  the  search  continues  for  the  vocabulary  structure  type  that,  in  MDL  terms,  best 
approximates  the  subgraph.  To  this  end,  we  encode  the  subgraph  as  each  of  the  6  candidate 
vocabulary  structures,  and  choose  the  structure  that  has  the  lowest  encoding  cost. 

Let  m*  be  the  graph  model  with  only  one  subgraph  encoded  as  structure  e  Cl  (e.g.,  clique)  and 
the  additional  edges  included  in  the  error  matrix.  For  reasons  of  efficiency,  instead  of  calculating 
the  full  cost  L(G,  m*)  as  the  encoding  cost  of  each  subgraph  representation,  we  estimate  the  local 
encoding  cost  L(m*)  +  L(E+«)  +  L(E~*)}  where  E+«  and  E~»  encode  the  incorrectly  modeled, 
and  unmodeled  edges,  respectively  (Section  3.2).  The  challenge  of  the  step  is  to  efficiently  identify 
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the  role  of  each  node  in  the  subgraph  (e.g.,  hub/spoke  in  a  star,  member  of  set  A  or  B  in  a 
near-bipartite  core,  order  of  nodes  in  chain)  for  the  MDL  representation.  We  elaborate  on  each 
structure  next. 


•  Clique:  This  representation  is  straightforward,  as  all  the  nodes  have  the  same  structural 
role.  All  the  nodes  are  members  of  the  clique  or  the  near-clique.  For  the  full  clique,  the 
missing  edges  are  stored  in  a  local  error  matrix,  E/c,  in  order  to  obtain  an  estimate  of  the 
global  encoding  cost  L(fc)  +  L(E^)  +  L(E^).  For  near-cliques  we  ignore  E„c,  and,  so,  the 
encoding  cost  is  L[nc). 

•  Star:  Representing  a  given  subgraph  as  a  near-star  is  straightforward  as  well.  We  find  the 
highest-degree  node  (in  case  of  a  tie,  we  choose  one  randomly),  and  set  it  to  be  the  hub  of 
the  star,  and  identify  the  rest  of  the  nodes  as  the  peripheral  nodes  —  which  are  also  referred 
to  as  spokes.  The  additional  or  missing  edges  are  stored  in  the  local  Error  matrix,  Est.  The 
MDL  cost  of  this  encoding  is  computed  as  L(st)  +  L(Ej )  +  L(E^). 

•  Bipartite  core:  In  this  case,  the  problem  of  identifying  the  role  of  each  node  reduces 
to  finding  the  maximum  bipartite  graph,  which  is  known  as  the  max-cut  problem,  and 
is  NP-hard.  The  need  of  a  scalable  graph  summarization  algorithm  makes  us  resort  to 
approximation  algorithms.  In  particular,  finding  the  maximum  bipartite  graph  can  be 
reduced  to  semi-supervised  classification.  We  consider  two  classes  which  correspond  to 
the  two  node  sets,  A  and  B,  of  the  bipartite  graph,  and  the  prior  knowledge  is  that  the 
highest-degree  node  belongs  to  A,  and  its  neighbors  to  B.  To  propagate  these  classes/labels, 
we  employ  Fast  Belief  Propagation  (FaBP  in  Chapter  4  and  [KKK 1  11])  assuming  heterophily 
(i.e.,  connected  nodes  belong  to  different  classes).  For  near-bipartite  cores  L(E^)  is  omitted. 

•  Chain:  Representing  the  subgraph  as  a  chain  reduces  to  finding  the  longest  path  in  it,  which 
is  also  NP-hard.  We,  therefore,  employ  the  following  heuristic.  Initially,  we  pick  a  node  of 
the  subgraph  at  random,  and  find  its  furthest  node  using  BFS  (temporary  start) .  Starting 
from  the  latter  and  by  using  BFS  again,  we  find  the  subsequent  furthest  node  (temporary 
end) .  We  then  extend  the  chain  by  local  search.  Specifically,  we  consider  the  subgraph  from 
which  all  the  nodes  that  already  belong  to  the  chain,  except  for  its  endpoints,  are  removed. 
Then,  starting  from  the  endpoints  we  employ  BFS  again.  If  new  nodes  are  found  during  this 
step,  they  are  added  in  the  chain  (rendering  it  a  near-chain  with  few  loops).  The  nodes  of 
the  subgraph  that  are  not  members  of  this  chain  are  encoded  as  error  in  Ec/,. 


After  representing  the  subgraph  as  each  of  the  vocabulary  structures  x,  we  employ  MDL  to 
choose  the  representation  with  the  minimum  (local)  encoding  cost,  and  add  the  structure  to 
the  candidate  set,  G.  Finally,  we  associate  the  candidate  structure  with  its  encoding  benefit:  the 
savings  in  bits  for  encoding  the  subgraph  by  the  minimum-cost  structure  type,  instead  of  leaving 
its  edges  unmodeled  and  including  them  in  the  error  matrix. 
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Algorithm  3.1  VoG 
Input:  graph  G 

Step  1:  Subgraph  Generation.  Generate  candidate  -  possibly  overlapping  -  subgraphs  using 
one  or  more  graph  decomposition  methods. 

Step  2:  Subgraph  Labeling.  Characterize  the  type  of  each  subgraph  xefl  using  MDL,  identify 
the  type  x  as  the  one  that  minimizes  the  local  encoding  cost.  Populate  the  candidate  set  C 
accordingly. 

Step  3:  Summary  Assembly.  Use  the  heuristics  Plain,  TopIO,  TopIOO,  Greedy’nForget  (Sec¬ 
tion  3.3.3)  to  select  a  non-redundant  subset  from  the  candidate  structures  to  instantiate  the 
graph  model  M.  Pick  the  model  of  the  heuristic  with  the  lowest  description  cost.  RETURN: 

graph  summary  M  and  its  encoding  cost. 


3.3.3  Step  3:  Summary  Assembly 

Given  a  set  of  candidate  structures,  G,  how  can  we  efficiently  induce  the  model  M  that  is  the  best 
graph  summary?  The  exact  selection  algorithm,  which  considers  all  the  possible  ordered  combina¬ 
tions  of  the  candidate  structures  and  chooses  the  one  that  minimizes  the  cost,  is  combinatorial, 
and  cannot  be  applied  to  any  non-trivial  candidate  set.  Thus,  we  need  heuristics  that  will  give  a 
fast,  approximate  solution  to  the  description  problem.  To  reduce  the  search  space  of  all  possible 
permutations,  we  attach  a  quality  measure  to  each  candidate  structure,  and  consider  them  in  order 
of  decreasing  quality.  The  measure  that  we  use  is  the  encoding  benefit  of  the  subgraph,  which,  as 
mentioned  before,  is  the  number  of  bits  that  are  gained  by  encoding  the  subgraph  as  structure  x 
instead  of  noise.  Our  constituent  heuristics  are: 

•  Plain:  The  baseline  approach  gives  all  the  candidate  structures  as  graph  summary,  i.e., 
M  =  C. 

•  Top-k:  Selects  the  top-k  candidate  structures  as  sorted  according  to  decreasing  quality. 

•  Greedy’nForget  (GnF):  Considers  each  structure  in  G  sequentially,  sorted  by  descending 
quality,  and  iteratively  includes  each  in  M:  as  long  as  the  total  encoded  cost  of  the  graph 
does  not  increase,  keeps  the  structure  in  M.,  otherwise  it  removes  it.  Greedy’nForget 
continues  this  process  until  all  the  structures  in  Q  have  been  considered.  This  heuristic  is 
more  computationally  demanding  than  the  plain  or  top-k  heuristics,  but  still  handles  large 
sets  of  candidate  structures  efficiently. 

VoG  employs  all  the  heuristics  and  by  MDL  picks  the  overall  best  graph  summarization,  or 
equivalently,  the  summarization  with  the  minimum  description  cost. 

3.3.4  Toy  Example 

To  illustrate  how  VoG  works,  we  give  an  example  on  a  toy  graph.  We  apply  VoG  on  the  synthetic 
Caveman  graph  of  841  nodes  and  7  547  edges  which,  as  shown  in  Figure  3.4,  consists  of  two 
cliques  separated  by  two  stars.  The  leftmost  and  rightmost  cliques  consist  of  42,  and  110  nodes, 
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respectively;  the  big  star  (2nd  structure)  has  800  nodes,  and  the  small  star  (3rd  structure)  91 
nodes.  Here  is  how  VoG  works  step-by-step: 

•  Step  1:  The  raw  output  of  the  decomposition  algorithm  consists  of  the  subgraphs  corre¬ 
sponding  to  the  stars,  the  full  left-hand  and  right-hand  cliques,  as  well  as  the  subsets  of 
these  nodes. 

•  Step  2:  Through  MDL,  VoG  correctly  identifies  the  type  of  these  structures. 

•  Step  3:  Finally,  via  Greedy’nForget,  it  automatically  finds  the  four  true  structures  without 
redundancy,  and  drops  the  structures  that  consist  of  subsets  of  nodes. 

The  corresponding  model  requires  36%  fewer  bits  than  the  ‘empty’  model,  where  the  graph 
edges  are  encoded  as  noise.  We  note  that  one  bit  gain  already  corresponds  to  twice  the  likelihood. 


star 


clique 


Figure  3.4:  Toy  graph:  VoG  saves  36%  in  space,  by  successfully  discovering  the  two  cliques  and  two 
stars  that  we  chained  together. 


3.3.5  Time  Complexity  of  VoG 

For  a  graph  G(V,  £)  of  n  =  |V|  nodes  and  m  =  |£|  edges,  the  time  complexity  of  VoG  depends  on 
the  runtime  complexity  of  the  algorithms  that  compose  it,  namely  the  decomposition  algorithm, 
the  subgraph  labeling,  the  encoding  scheme  L(G,M)  of  the  model,  and  the  structure  selection 
(summary  assembly). 

For  the  decomposition  of  the  graph,  we  use  SlashBurn  which  is  near-linear  on  the  number  of 
edges  of  real  graphs  [  ] .  The  subgraph  labeling  algorithms  in  Sec.  4  are  carefully  designed  to 

be  linear  on  the  number  of  edges  of  the  input  subgraph. 

When  there  is  no  overlap  between  the  structures  in  M.,  the  complexity  of  calculating  the 
encoding  scheme  L(G,  M)  is  O(m).  When  there  is  some  overlap,  the  complexity  is  bigger:  assume 
that  s,  t  are  two  structures  e  M  with  overlap,  and  t  has  higher  quality  than  s,  i.e.,  t  comes  before  s 
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in  the  ordered  list  of  structures.  Finding  how  much  ‘new’  structure  (or  area  in  A)  s  explains  relative 
to  t  costs  0(|M|2).  Thus,  in  the  case  of  overlapping  subgraphs,  the  complexity  of  computing  the 
encoding  scheme  is  0(|M|2  +  m).  As  typically  |M|  <c  m,  in  practice  we  have  O(m). 

As  far  as  the  selection  method  is  concerned,  the  Top-k  heuristic  that  we  propose  has  complexity 
O(k).  The  Greedy’nForget  heuristic  has  runtime  0(|C|  x  o  x  m),  where  |C|  is  the  number  of 
structures  identified  by  VoG,  and  o  the  time  complexity  of  L(G,  M). 


3.4  Experiments 

In  this  section,  we  aim  to  answer  the  following  questions: 

Ql.  Are  the  real  graphs  structured,  or  random  and  noisy?  If  the  graphs  are  structured,  can  their 
structures  be  discovered  under  noise? 

Q2.  What  structures  do  the  graph  summaries  contain,  and  how  can  they  be  used  for  understanding? 
Q3.  Is  VoG  scalable  and  able  to  efficiently  summarize  large  graphs? 

The  graphs  we  use  in  the  experiments  along  with  their  descriptions  are  summarized  in  Table  3.2. 
Liancourt-Rocks  is  a  co-editor  graph  on  a  controversial  Wikipedia  article  about  the  island 
Liancourt  Rocks,  where  the  nodes  are  users,  and  the  edges  mean  that  they  edited  the  same 
sentence.  Chocolate  is  a  co-editor  graph  on  the  ‘Chocolate’  article.  The  descriptions  of  the  other 
datasets  are  given  in  Table  3.2. 

Table  3.2:  VoG:  Summary  of  graphs  used. 


Name 

Nodes  Edges 

Description 

Flickr  [fli] 

404733  2110  078 

Friendship  social  network 

WWW-Barabasi  [SNA] 

325  729  1090108 

WWW  in  nd.edu 

Epinions  [5N/] 

75  888  405  740 

Trust  graph 

Enron  [  ni  ] 

80163  288  364 

Enron  email 

AS-Oregon  [  IS(  ] 

13  579  37448 

Router  connections 

Wikipedia-Liancourt -Rocks 

1  005  2  123 

Co-edit  graph 

Wikipedia-Chocolate 

2  899  5  467 

Co-edit  graph 

Graph  Decomposition.  In  our  experiments,  we  modify  SlashBurn  [LKF1  ],  a  node  reordering 
algorithm,  to  generate  candidate  subgraphs.  The  reasons  we  use  SlashBurn  are  (a)  it  is  scalable, 
and  (b)  it  is  designed  to  handle  graphs  without  caveman  structure.  We  note  that  VoG  would  only 
benefit  from  using  the  outputs  of  additional  decomposition  algorithms. 

SlashBurn  is  an  algorithm  that  reorders  the  nodes  so  that  the  resulting  adjacency  matrix 
has  clusters  or  patches  of  non-zero  elements.  The  idea  is  that  removing  the  top  high-degree 
nodes  in  real-world  graphs  results  in  the  generation  of  many  small-sized  disconnected  components 
(subgraphs),  and  one  giant  connected  component  whose  size  is  significantly  smaller  compared  to 
the  original  graph.  Specifically,  it  performs  two  steps  iteratively:  (a)  It  removes  top  high  degree 
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nodes  from  the  original  graph;  (b)  It  reorders  the  nodes  so  that  the  high-degree  nodes  go  to  the 
front,  the  disconnected  components  to  the  back,  and  the  giant  connected  component  (GCC)  to  the 
middle.  During  the  next  iterations,  these  steps  are  performed  on  the  giant  connected  component. 
A  good  node-reordering  method  will  reveal  patterns,  as  well  as  large  empty  areas,  as  shown  in 
Figure  3.5  on  the  Wikipedia  Chocolate  network. 


(a)  Original.  (b)  After  re-ordering. 

Figure  3.5:  Adjacency  matrix  before  and  after  node-ordering  on  the  Wikipedia  Chocolate  graph. 
Large  empty  (and  dense)  areas  appear,  aiding  the  graph  decomposition  step  of  VoG  and  the  discov¬ 
ery  of  candidate  structures. 


In  this  work,  SlashBurn  is  modified  to  decompose  the  input  graph.  In  more  details,  we  focus 
on  the  first  step  of  the  algorithm,  which  removes  the  high  degree  node  by  “burning”  its  edges.  This 
step  is  depicted  for  a  toy  graph  in  Figure  3.6(b),  where  the  green  dotted  line  shows  which  edges 
were  “burnt”.  Then,  the  hub  with  its  egonet,  which  consists  of  the  hub’s  one  hop  away  neighbors 
and  the  connections  between  them,  form  the  first  candidate  structures.  Moreover,  the  connected 
components  with  a  size  greater  or  equal  to  two  and  smaller  than  the  size  of  the  GCC,  consist  of 
additional  candidate  structures  (see  Figure  3.6(c)).  In  the  next  iteration,  the  same  procedure  is 
applied  to  the  giant  connected  component,  yielding  this  way  a  set  of  candidate  structures.  We  use 
MDL  to  determine  the  best-fitting  type  per  discovered  candidate  structure. 


(a)  Initial  toy  graph. 


SpoKes 


(b)  SlashBurn  on  the  toy  (c)  Candidate  structures  (in 
graph.  circles). 


Figure  3.6:  Illustration  of  the  graph  decomposition  and  the  generation  of  the  candidate  structures. 
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3.4.1  Q 1 :  Quantitative  Analysis 

In  this  section  we  apply  VoG  to  the  real  datasets  of  Table  3.2,  and  evaluate  the  achieved  description 
cost,  and  edge  coverage,  which  are  indicators  of  the  discovered  structures.  The  evaluation  is  done 
in  terms  of  savings  with  respect  to  the  base  encoding  (Original)  of  the  adjacency  matrix  of  a 
graph  with  an  empty  model  M.  Moreover,  by  using  synthetic  datasets,  we  evaluate  the  ability 
of  VoG  to  discover  the  ground  truth  graph  structures  under  the  presence  of  noise.  Finally,  we 
discuss  the  selection  of  the  vocabulary  for  summarization,  and  compare  the  different  selections 
quantitatively  in  terms  of  the  encoding  cost  of  the  summaries  that  they  generate. 


Description  Cost 

Although  we  refer  to  the  description  cost  of  the  summarization  techniques,  we  note  that  com¬ 
pression  itself  is  not  our  goal,  but  our  means  for  identifying  the  structures  important  for  graph 
understanding  or  attention  routing3.  This  is  also  why  it  does  not  make  sense  to  compare  VoG 
with  standard  matrix  compression  techniques:  whereas  VoG  has  the  goal  of  describing  a  graph 
with  simple  and  easy-to-understand  structures,  specialized  algorithms  may  exploit  any  statistical 
correlations  to  save  bits. 

We  compare  three  summarization  approaches:  (a)  Original:  The  whole  adjacency  matrix 
is  encoded  as  if  it  contains  no  structure;  that  is,  M.  =  0,  and  A  is  encoded  through  L(E  ');  (b) 
SB+nc:  All  the  subgraphs  extracted  by  our  method  (first  step  of  Algorithm  3.1  using  our  proposed 
variant  of  SlashBurn)  are  encoded  as  near-cliques;  and  (c)  VoG:  Our  proposed  summarization 
algorithm  with  the  three  selection  heuristics  (Plain,  TopIO  and  TopIOO,  Greedy’nForget4). 

We  ignore  very  small  structures;  the  candidate  set  G  includes  subgraphs  with  at  least  10  nodes, 
except  for  the  Wikipedia  graphs  where  the  size  threshold  is  set  to  3  nodes.  Among  the  summaries 
obtained  by  the  different  heuristics,  we  choose  the  one  that  yields  the  smallest  description  length. 

Table  3.3  presents  the  summarization  cost  of  each  technique  with  respect  to  the  cost  of  the 
Original  approach,  as  well  as  the  fraction  of  the  edges  that  remains  unexplained.  Specifically,  the 
first  column,  Original,  presents  the  cost,  in  bits,  of  encoding  the  adjacency  matrix  with  an  empty 
model  M.  The  second  column,  SB  +  nc,  presents  the  relative  number  of  bits  needed  to  describe 
the  structures  discovered  by  SlashBurn  as  near-cliques.  Then,  for  different  VoG  heuristics  we 
show  the  relative  number  of  bits  needed  to  describe  the  adjacency  matrix.  In  the  last  four  columns, 
we  give  the  fraction  of  edges  that  are  not  explained  by  the  structures  in  the  model,  M,  that  each 
heuristic  finds.  The  lowest  description  cost  is  in  bold. 

3High  compression  ratios  are  exactly  a  sign  that  many  redundancies  (i.e.,  patterns)  that  can  be  explained 
in  simple  terms  (i.e.,  structures)  were  discovered. 

4By  carefully  designing  the  Greedy’nForget  heuristic  to  exploit  memoization,  we  are  able  to  efficiently 
compute  the  best  structures  within  the  candidate  set.  Although  restricting  our  search  space  to  a  small 
number  of  candidate  structures  -ranked  in  decreasing  order  of  quality-  can  yield  faster  results,  we  report 
results  on  the  whole  search  space. 
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Table  3.3:  [Lower  is  better.]  Quantitative  comparison  of  baseline  methods  and  VoG  with  different 
summarization  heuristics.  The  first  column,  Original,  presents  the  cost,  in  bits,  of  encoding  the 
adjacency  matrix  with  an  empty  model  M.  The  other  columns  give  the  relative  number  of  bits 
needed  to  describe  the  adjacency  matrix. 


VoG 


Original  SB+nc  Compression  Unexplained  edges 

(bits)  (% bits)  Plain  TopIO  TopIOO  GnF  Plain  TopIO  TopIOO  GnF 


Flickr 

35  210  972 

92% 

81% 

99% 

97% 

95% 

4% 

72% 

39% 

36% 

WWW-Barabasi 

18  546330 

94% 

81% 

98% 

96% 

85% 

3% 

62% 

51% 

38% 

Epinions 

5  775  964 

128% 

82% 

98% 

95% 

81% 

6% 

65% 

46% 

14% 

Enron 

4292  729 

121% 

75% 

98% 

93% 

75% 

2% 

77% 

46% 

6% 

AS-Oregon 

475  912 

126% 

72% 

87% 

79% 

71% 

4% 

59% 

25% 

12% 

Chocolate 

60  310 

127% 

96% 

96% 

93% 

88% 

4% 

70% 

35% 

27% 

Liancourt -Rocks 

19  833 

138% 

98% 

94% 

96% 

87% 

5% 

51% 

12% 

31% 

The  lower  the  ratios  (i.e.,  the  lower  the  obtained  description  length),  the  more  structure  is 
identified.  For  example,  VoG-Plain  describes  Flickr  with  only  81%  of  the  bits  of  the  Original 
approach ,  and  explains  all  but  4%  of  the  edges,  which  means  that  4%  of  the  edges  are  not  encoded 
by  the  structures  in  M. 

Observation  1.  Real  graphs  do  have  structure;  VoG,  with  or  without  structure  selection, 
achieves  better  compression  than  the  Original  approach  that  assumes  no  structure,  as  well  as 
the  SB  +  nc  approach  that  encodes  all  the  subgraphs  as  near-cliques  . 

We  observe  that  the  SB  +  nc  approach  often  requires  more  bits  than  the  Original  approach, 
which  assumes  no  structure.  This  is  due  to  the  fact  that  the  discovered  structures  often  overlap 
and,  thus,  some  nodes  are  encoded  multiple  times.  Moreover,  a  near-clique  is  not  the  optimal 
model  for  all  the  extracted  structures;  if  it  were,  VoG-Plain  (which  is  more  expressive  and  allows 
more  models,  such  as  full  clique,  bipartite  core,  star,  and  chain)  would  have  resulted  in  higher 
encoding  cost  than  the  SB  +  nc  method. 

Greedy’nForget  finds  models  M  with  fewer  structures  than  Plain  and  ToPlOO-which  is 
important  for  graph  understanding  and  guiding  attention  to  few  structures-,  and  often  obtains 
(much)  more  succinct  graph  descriptions.  This  is  due  to  its  ability  to  identify  structures  that 
are  informative  with  regard  to  what  it  already  knows.  In  other  words,  structures  that  highly 
overlap  with  ones  already  selected  into  M  will  be  much  less  rewarded  than  structures  that  explain 
unexplored  parts  of  the  graph. 


Discovering  structures  under  noise 


In  this  section  we  evaluate  whether  VoG  is  able  to  detect  the  underlying  graph  structures  under 
noise.  We  start  with  the  Caveman  graph  described  in  Section  3.3.4  (original  graph),  and  generate 
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noisy  instances  by  reverting 

e  =  {0.1%,  0.5%,  1%,5%,  10%,  15%,  20%,  25%,  30%,  35%} 

of  the  original  graph’s  edges  (£orig)-  For  example,  at  noise  level  e  =  0.1%,  we  randomly  pick 
0.1%|£orjg|  pairs  of  nodes.  If  an  edge  existed  between  two  selected  nodes  in  the  original  graph,  we 
remove  it  in  the  noisy  instance;  otherwise,  we  add  it.  To  evaluate  our  method’s  ability  to  detect  the 
underlying  structures,  we  treat  the  building  blocks  of  the  original  graph  as  ground  truth  structures, 
and  compute  the  precision  and  recall  of  VoG-Greedy’nForget  at  different  levels  of  noise.  We 
define  the  precision  as: 


#  of  relevant  retrieved  structures 

precision  = - -—z - 7 - 3 - , 

#  of  retrieved  structures 

where  we  consider  a  structure  relevant  if  it  (i)  overlaps  with  a  ground  truth  structure,  and 
(ii)  has  the  same,  or  similar  (full  and  near-clique,  or  full  and  near-bipartite  core)  connectivity 
pattern  to  the  overlapping  ground  truth  structure. 

We  define  recall  as: 


recall  #  retl^evec'  8round  truth  structures 
#  of  relevant  ground  truth  structures  ’ 

where  we  consider  a  ground  truth  structure  retrieved  if  VoG  returned  at  least  one  overlapping 
structure  with  the  same  or  similar  connectivity  pattern. 

In  addition  to  the  precision  and  recall,  we  also  define  the  “weighted  precision”,  which  penalizes 
retrieved  VoG  structures  that  partially  overlap  with  the  ground  truth  structures. 

.  ,  .  .  Y_  node-overlap  of  relevant  retrieved  structures 

weighted-precision  =  — - - — - - : - - - , 

#  of  retrieved  structures 

where  the  numerator  is  the  sum  of  the  node  overlap  of  relevant  retrieved  structures  and  the 
corresponding  ground  truth  structures. 

In  our  experiment,  we  generate  ten  noisy  instances  of  the  original  graph  at  each  noise  level  e. 
In  Figure  3.7,  we  give  the  precision,  recall,  and  weighted  precision  averaged  over  the  10  graph 
instances  at  each  level  of  noise.  The  error  bars  in  the  plot  correspond  to  the  standard  deviation  of 
the  quantities.  In  addition  to  the  accuracy  metrics,  we  provide  the  average  number  of  retrieved 
structures  per  noise  level  in  Figure  3.8. 

We  observe  that  at  all  noise  levels,  VoG  has  high  precision  and  recall  (above  0.85  and  0.75, 
respectively) .  The  weighted  precision  decreases  as  the  noise  increases,  but  it  remains  high  at  low 
levels  of  noise  (<  5%). 

Observation  2.  VoG  routes  attention  to  the  ground  truth  structures  even  under  the  presence  of 
noise. 

The  high  precision  and  recall  of  VoG  at  levels  of  noise  greater  than  5%  are  due  to  the  big 
number  of  structures  that  are  retrieved  («20).  As  evidenced  by  the  weighted  precision,  the 
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Figure  3.7:  VoG  routes  attention  to  the  ground  truth  structures  even  under  the  presence  of  noise. 
We  show  the  precision,  recall,  and  weighted-precision  of  VoG  at  different  levels  of  noise. 
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Figure  3.8:  Number  of  structures  retrieved  under  the  presence  of  noise.  As  the  noise  increases  and 
more  loosely  connected  components  are  created,  VoG  retrieves  more  structures. 


retrieved  structures  are  relevant  to  the  ground  truth  structures  (they  are  counted  as  ‘hits’  by  the 
definition  of  precision  and  recall),  but  they  do  not  recover  the  ground  truth  structures  perfectly5 
leading  to  lower  levels  of  weighted  precision.  On  the  other  hand,  for  noise  below  5%,  we  observe 
a  small  drop  in  the  precision  and  recall.  Although  the  retrieved  and  ground  truth  structures 
are  almost  equal  in  number,  and  have  high  node  overlap,  they  do  not  always  have  the  same 
connectivity  patterns  (e.g.,  a  star  matches  to  a  near-bipartite  core),  which  leads  to  a  slight  decrease 
in  the  precision  and  recall  of  VoG. 

Discussion  about  the  Vocabulary 

In  Section  3.2.1,  we  introduced  the  vocabulary  we  chose  for  graph  summarization,  which  consists 
of  six  very  common  structures  in  real-world  graphs  [  ,  ,  ] :  full  and  near¬ 

cliques  (/c,zzc),  full  and  near-bipartite  cores  (fb,nb ),  stars  Cst),  and  chains  (c/z).  Here  we  vary  the 
vocabulary  by  dropping  the  near-structures,  and  re-evaluate  VoG  in  terms  of  savings  with  respect 

5The  overlap  of  the  retrieved  structures  and  the  ground  truth  structures  is  often  <  1. 
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to  the  Original  encoding,  which  assumes  an  empty  model  M.  We  refer  to  our  method  with  the 
reduced  vocabulary  as  VoG-reduced.  The  results  for  the  original  and  reduced  vocabulary  are 
given  in  Table  3.4.  For  different  VoG  and  VoG-reduced  heuristics  we  show  the  relative  number 
of  bits  needed  to  describe  the  adjacency  matrix  compared  to  the  original  number  of  bits  that  is 
given  in  Table  3.3.  From  the  same  table  we  repeat  the  first  four  columns  in  order  to  make  the 
comparison  easier.  The  lowest  description  cost  per  vocabulary  version  is  given  in  bold. 

Table  3.4:  [Lower  is  better.]  VoG-reduced  is  comparable  to  VoG.  We  give  the  relative  number  of 
bits  needed  to  describe  the  adjacency  matrix  with  respect  to  the  original  number  of  bits  that  is  given 
in  Table  3.3. 


Graph 

VoG  (with  near-structures) 

VoG-reduced  (without  near-structures) 

Plain 

TopIO 

TOP100 

GnF 

Plain 

TopIO 

TOP100 

GnF 

Flickr 

81% 

99% 

97% 

95% 

171% 

99% 

97% 

85% 

WWW-Barabasi 

81% 

98% 

96% 

85% 

213% 

98% 

92% 

81% 

Epinions 

82% 

98% 

95% 

81% 

82% 

98% 

95% 

81% 

Enron 

75% 

98% 

93% 

75% 

80% 

98% 

93% 

72% 

AS-Oregon 

72% 

87% 

79% 

71% 

74% 

87% 

76% 

70% 

Chocolate 

96% 

96% 

93% 

88% 

92% 

94% 

92% 

92% 

Liancourt -Rocks 

98% 

94% 

96% 

87% 

92% 

91% 

91% 

92% 

Overall,  for  almost  all  the  real  datasets,  we  observe  that  VoG  achieves  the  same  or  slightly 
lower  encoding  cost  than  VoG-reduced.  However,  if  we  focus  only  on  the  Greedy’nForget 
heuristic,  we  notice  that,  in  half  of  the  cases  (Flickr,  WWW-Barabasi,  Enron,  AS-Oregon), 
the  reduced  vocabulary  results  in  better  compression  than  the  original  one.  Despite  the  differences 
in  the  encoding  cost  of  VoG,  depending  on  the  vocabulary  that  is  being  considered,  the  results 
are  comparable.  Therefore,  the  selection  of  the  vocabulary  depends  on  the  analyst  and  the 
application  of  interest;  more  vocabulary  terms  allow  for  more  expressivity  and  may  pinpoint 
patterns  with  interesting  semantic  meanings.  However,  if  the  analyst  wants  to  focus  on  a  reduced 
set  of  structures,  she  can  limit  the  vocabulary  accordingly,  and  find  the  “best”  graph  summary 
contingent  on  the  assumed  vocabulary. 


3.4.2  Q2:  Qualitative  Analysis 

In  this  section,  we  showcase  how  to  use  VoG  and  interpret  the  graph  summaries  that  it  outputs. 

Graph  Summaries 

How  well  does  VoG  summarize  real  graphs?  Which  are  the  most  frequent  structures?  Table  3.5 
shows  the  summarization  results  of  VoG  for  different  structure  selection  techniques. 

Observation  3.  The  summaries  of  all  the  selection  heuristics  consist  mainly  of  stars,  followed  by 
near-bipartite  cores.  In  some  graphs,  like  Flickr  and  WWW-Barabasi,  there  are  a  significant 
number  of  full  cliques. 
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Table  3.5:  Summarization  of  graphs  by  VoG.  The  most  frequent  structures  are  the  stars  and  near- 
bipartite  cores.  We  provide  the  frequency  of  each  structure  type:  ‘st’  for  star,  ‘nb’  for  near-bipartite 
cores,  ‘fc’  for  full  cliques,  ‘fb’  for  full  bipartite-cores,  ‘ch’  for  chains,  and  ‘nc’  for  near-cliques. 


Graph 

Plain 

TOP10 

TOP100 

Greedy’nForget 

St 

nb 

fc 

fb 

ch 

nc 

st 

nb 

st 

nb 

fb 

ch 

st 

nb 

fc 

fb 

Flickr 

24385 

3  750 

281 

9 

- 

3 

10 

- 

99 

1 

- 

- 

415 

- 

- 

1 

WWW-Barabas i 

10027 

1684 

487 

120 

26 

- 

9 

1 

83 

14 

3 

- 

4177 

161 

328 

85 

Epinions 

5  204 

528 

13 

- 

- 

- 

9 

1 

99 

1 

- 

- 

2  644 

- 

8 

- 

Enron 

3171 

178 

3 

11 

- 

- 

9 

1 

99 

1 

- 

- 

1810 

- 

2 

2 

AS-Oregon 

489 

85 

- 

4 

- 

- 

10 

- 

93 

6 

1 

- 

399 

- 

- 

- 

Chocolate 

170 

58 

- 

- 

17 

- 

9 

1 

87 

10 

- 

3 

101 

- 

- 

- 

Liancourt-Rocks 

73 

21 

- 

1 

22 

- 

8 

2 

66 

17 

1 

16 

39 

- 

- 

- 

From  Table  3.5  we  also  observe  that  Greedy’nForget  drops  uninteresting  structures,  and 
reduces  the  graph  summary.  Effectively,  it  filters  out  the  structures  that  explain  edges  already 
explained  by  structures  in  model  M. 

How  often  do  we  find  perfect  cliques,  bipartite  cores  etc.,  in  real  graphs?  To  each  structure, 
we  attach  a  quality  score  that  quantifies  how  close  the  structure  that  VoG  discovered  (e.g.,  a 
near-bipartite  core)  is  to  the  “perfect”  structure  consisting  of  the  same  nodes  (e.g.,  perfect  bipartite 
score  on  the  same  nodes,  without  any  errors).  We  define  the  quality  score  of  structure  s  as: 


quality  (s) 


encoding  cost  of  error-free  structure  €  O 
encoding  cost  of  discovered  structure 


The  quality  score  takes  values  between  0  and  1.  A  quality  score  that  tends  to  0  corresponds  to  a 
structure  that  deviates  significantly  from  the  “perfect”  structure,  while  1  means  that  the  discovered 
structure  is  perfect  (e.g.,  error-free  star).  Table  3.6  gives  the  average  quality  of  the  structures 
discovered  by  VoG  in  real  datasets. 

By  leveraging  the  MDL  principle,  VoG  can  discover  not  only  exact  structures,  but  also  approxi¬ 
mate  structures  that  have  some  erroneous  edges.  In  the  real  datasets  that  we  studied,  the  chains 
that  VoG  discovers  do  not  have  any  missing  or  additional  edges;  this  is  probably  due  to  the  small 
size  of  the  chains  (4  nodes  long,  on  average,  for  Chocolate  and  Liancourt-Rocks,  and  20 


Table  3.6:  Quality  of  the  structures  discovered  by  VoG.  For  each  structure  type,  we  provide  the 
average  (and  standard  deviation)  of  the  quality  of  the  discovered  structures. 


Graph 

St 

nb 

fc 

fb 

ch  nc 

WWW-Barabas i 

0.78  (0.25) 

0.77  (0.22) 

0.55  (0.17) 

0.51  (0.42) 

1  (0)  - 

Epinions 

0.66  (0.27) 

0.82  (0.15) 

0.50  (0.08) 

- 

- 

Enron 

0.62  (0.65) 

0.85  (0.19) 

0.53  (0.02) 

1  (0) 

- 

AS-Oregon 

0.65  (0.30) 

0.84  (0.18) 

- 

1  (0) 

- 

Chocolate 

0.75  (0.20) 

0.89  (0.19) 

- 

- 

1  (0)  - 

Liancourt-Rocks 

0.75  (0.26) 

0.94  (0.14) 

- 

1  (0) 

1  (0)  - 
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nodes  long  for  www-Barabasi).  The  near-bipartite  cores,  and  stars  are  also  of  high  quality  (at 
least  0.77  and  0.66,  respectively).  Finally,  the  discovered  cliques  are  almost  half-full,  as  evidenced 
by  the  0.50-0.55  quality  score. 

In  order  to  gain  a  better  understanding  of  the  structures  that  VoG  finds,  in  Figures  3.10  and  3.9 
we  give  the  size  distributions  of  the  most  frequent  structures  in  the  www-Barabasi  web  graph, 
the  Enron  email  network,  and  the  Flickr  social  network. 


Observation  4.  The  size  distribution  of  the  stars  and  near-bipartite  cores  follows  a  power  law. 


The  distribution  of  the  size  of  the  full  cliques  in  Flickr  follows  a  power  law  as  well,  while 
the  distributions  of  the  full  cliques  and  bipartite  cores  in  www-Barabasi  do  not  show  any  clear 
pattern.  In  Figures  3.9  and  3.10,  we  denote  with  blue  crosses  the  size  distribution  of  the  structures 
discovered  by  VoG-Plain,  and  with  red  circles  the  size  distribution  for  the  structures  found  by 
VoG  with  the  TopIOO  heuristic. 
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Flickr:  Size  Distribution  of  Stars 
xVoG 


log(size  of  star) 

(a)  Stars. 

IQ2 


Flickr:  Size  Distribution  of  Full  and  Near-Bipartite  Cores 
103 

*VoG  for  nb 
•  VoG-toplOO  for  nb 


o>  1 
.2  10 


10 


4. 


1301  nodes 

i 


10 


10  10 
log(size  of  (near-)  bipartite  core) 


10 


(b)  Bipartite  and  near-bipartite  cores. 
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Figure  3.9:  Flickr:  The  size  of  stars,  near-bipartite  cores  and  full  cliques  follows  the  power  law 
distribution.  Distribution  of  the  size  of  the  structures  by  VoG  (blue  crosses)  and  VoG-TopIOO  (red 
circles)  that  are  the  most  informative  from  an  information-theoretic  point  of  view. 
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Figure  3.10:  The  distribution  of  the  size  of  the  structures  (starts,  near-bipartite  cores)  that  are 
the  most  ‘interesting’  from  the  MDL  point  of  view,  follow  the  power  law  distribution  both  in  the 
WWW-Barabasi  web  graph  (top)  and  the  Enron  email  network  (bottom).  The  distribution  of  struc¬ 
tures  discovered  by  VoG  and  VoG-TOPlOO  are  denoted  by  blue  crosses  and  red  circles,  respectively. 
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Graph  Understanding 


Are  the  ‘important’  structures  found  by  VoG  semantically  meaningful?  For  sense-making,  we  ana¬ 
lyze  the  discovered  subgraphs  in  three  non-anonymized  real  datasets:  Wikipedia-Liancourt-Rocks 
Wikipedia— Chocolate  and  Enron. 


Wikipedia-Liancourt-Rocks.  Figures  3.1  and  3.11(a-b)  illustrate  the  original  and  VoG-based 
visualization  of  the  Liancourt-Rocks  graph.  The  VoG-TopIO  summary  consists  of  8  stars  and 
2  near-bipartite  cores  (see  also  Table  3.5).  To  visualize  the  graph  we  leveraged  the  structures 
to  which  VoG  draws  attention  by  using  Cytoscape6:  For  Figure  3.11(a),  we  applied  the  spring- 
embedded  layout  to  the  Wikipedia  graph,  and  then  highlighted  the  centers  of  the  8  stars  that  VoG 
discovered  by  providing  the  list  of  their  corresponding  IDs.  For  Figure  3.11(b),  we  input  the  list  of 
nodes  belonging  to  one  side  of  the  most  ‘important’  bipartite  core  that  VoG  discovered,  selected 
and  dragged  the  corresponding  nodes  to  the  left  top  corner,  and  applied  the  circular  layout  to 
them.  The  second  most  important  bipartite  core  is  visualized  in  the  same  way  in  Figure  3.1(d). 

The  8  star  configurations  correspond  mainly  to  administrators,  such  as  “Future_Perfect- 
_at_sunrise”,  who  do  many  minor  edits  in  various  parts  of  the  article  and  also  revert  vandalisms. 
The  most  interesting  structures  VoG  identifies  are  the  near-bipartite  cores,  which  reflect:  (i)  the 
conflict  between  the  two  parties  about  the  territorial  rights  to  the  island  (Japan  vs.  South  Korea), 
and  (ii)  an  “edit  war”  between  vandals  and  administrators  or  loyal  Wikipedia  users. 

In  Figure  3.11(c),  the  encoding  cost  of  VoG  is  given  as  a  function  of  the  selected  structures. 
The  dotted  blue  line  corresponds  to  the  cost  of  the  Plain  encoding,  where  the  structures  are  added 
sequentially  in  the  model  M,  in  decreasing  order  of  quality  (local  encoding  benefit) .  The  solid  red 
line  maps  to  the  cost  of  the  Greedy’nForget  heuristic.  Given  that  the  goal  is  to  summarize  the 

6http://www.cytoscape.org/ 


(a)  VoG:  The  8  most 
“important”  stars  (their 
centers  denoted  with 
red  rectangles). 


(b)  VoG:  The  most  “im¬ 
portant”  bipartite  graph 
(node  set  A  denoted  by 
the  circle  of  red  points) . 


(c)  Effectiveness  of  Greedy’nForget  (in 
red).  Encoding  cost  of  VoG  vs.  number  of 
structures  in  the  model,  M. 


Figure  3.11:  The  VoG  summary  of  the  Liancourt-Rocks  graph,  and  effectiveness  of  the 
Greedy’nForget  heuristic.  In  (c),  Greedy’nForget  leads  to  better  encoding  costs  and  smaller 
summaries  (here  only  40  are  chosen)  than  Plain  (~  120  structures). 


Figure  3.12:  Wikipedia-Liancourt-Rocks:  VoG-GnF  successfully  drops  uninteresting  structures 
that  explain  edges  already  explained  by  structures  in  model  M.  Diminishing  returns  in  the  edges 
explained  by  the  new  structures  that  are  added  by  VoG  and  VoG-GnF. 


graph  in  the  most  succinct  way,  and  at  the  same  time  achieve  low  encoding  cost,  Greedy’nForget 
is  effective.  Finally,  in  Figure  3.12  we  consider  the  models  M  with  increasing  number  of  structures 
(in  decreasing  quality  order)  and  show  the  number  of  edges  that  each  one  explains.  Specifically, 
Figure  3.12(a)  refers  to  the  models  that  consist  of  ordered  subsets  of  all  the  structures  (~120) 
that  VoG-Plain  discovered  and  Figure  3.12(b)  refers  to  models  incorporating  ordered  subsets 
of  the  ~35  structures  kept  by  VoG-Greedy’nForget.  We  note  that  the  slope  in  Figure  3.12(a)  is 
steep  for  the  first  ~35  structures,  and  then  increases  with  small  rate,  which  means  that  the  new 
structures  that  are  added  in  the  model  M  explain  few  new  edges  (diminishing  returns) .  On  the 
other  hand,  the  edges  explained  by  the  structures  discovered  by  VoG-Greedy’nForget  increase 
with  higher  rate,  which  signifies  that  VoG-Greedy’nForget  drops  uninteresting  structures  that 
explain  edges  that  are  already  explained  by  structures  in  model  M. 

Wikipedia-Chocolate  The  visualization  of  Wikipedia  Chocolate  is  similar  to  the  visual¬ 
ization  of  the  Liancourt-Rocks  and  is  given  in  Figure  3.13  for  completeness.  As  shown  in 
Table  3.5,  the  TopIO  summary  of  Chocolate  contains  9  stars  and  1  near-bipartite  core.  The 
center  of  the  highest-ranked  star  corresponds  to  “Chobot”,  a  Wikipedia  bot  that  fixes  interlanguage 
links,  and  thus  touches  several,  possibly  unrelated  parts  of  a  page.  The  hubs  (centers)  of  other 
stars  correspond  to  administrators,  who  do  many  minor  edits,  as  well  as  other  heavy  contributors. 
The  near-bipartite  core  captures  the  interactions  between  possible  vandals  and  administrators 
(or  Wikipedia  contributors)  who  were  reverting  each  other’s  edits  resulting  in  temporary  (semi-) 
protection  of  the  webpage.  Figure  3.13(c)  illustrates  the  effectiveness  of  the  Greedy’nForget 
heuristic  when  applied  to  the  Chocolate  article.  The  Greedy’nForget  heuristic  (red  line)  re¬ 
duces  the  encoding  cost  by  keeping  about  the  100  most  important  of  the  250  identified  structures 
(x  axis).  The  blue  line  corresponds  to  the  encoding  cost  (y  axis)  of  greedily  adding  the  identified 
structures  in  decreasing  order  of  encoding  benefit. 
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(a)  VoG:  The  9  most  “impor¬ 
tant”  stars  (the  hubs  of  the 
stars  denoted  by  the  cyan 
points). 


(b)  VoG:  The  most  “impor¬ 
tant”  bipartite  graph  (node 
set  A  denoted  by  the  rectan¬ 
gle  of  cyan  points). 


number  of  structures  in  the  model 


(c)  Effectiveness  of  Greedy’nForget  (in 
red).  Encoding  cost  of  VoG  vs.  the  number 
of  structures  in  the  model,  M. 


Figure  3.13:  VoG:  summarization  of  the  structures  of  the  Wikipedia  Chocolate  graph,  (a)-(b):  The 
top-10  structures  of  the  summary,  (c):  The  Greedy’nForget  heuristic  (red  line)  reduces  the  encod¬ 
ing  cost  by  keeping  about  the  100  most  important  of  the  250  identified  structures. 


Enron.  The  TopIO  summary  for  Enron  has  nine  stars  and  one  near-bipartite  core.  The  centers 
of  the  most  informative  stars  are  mainly  high  -ranking  officials  (e.g.,  Kenneth  Lay  with  two  email 
accounts,  Jeff  Skilling,  Tracey  Kozadinos) .  As  a  note,  Kenneth  Lay  was  long-time  Enron  CEO,  while 
Jeff  Skilling  had  several  high-ranking  positions  in  the  company,  including  CEO  and  managing 
director  of  Enron  Capital  &  Trade  Resources.  The  big  near-bipartite  core  in  Figure  3.14  is  loosely 
connected  to  the  rest  of  the  graph,  and  represents  the  email  communication  about  an  extramarital 
affair,  which  was  broadcast  to  235  recipients.  The  small  bipartite  graph  depicted  in  the  same  spy 
plot  captures  the  email  activity  of  several  employees  about  a  skiing  trip  on  New  Year. 
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Figure  3.14:  Enron:  Adjacency  matrix  of  the  top  near-bipartite  core  found  by  VoG,  corresponding 
to  email  communication  about  an  “affair”,  as  well  as  for  a  smaller  near-bipartite  core  found  by  VoG 
representing  email  activity  regarding  a  skiing  trip. 
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3.4.3  Q3:  Scalability  of  VoG 

In  Figure  3.15,  we  present  the  runtime  of  VoG  with  respect  to  the  number  of  edges  in  the  input 
graph.  For  this  purpose,  we  induce  subgraphs  of  the  Notre  Dame  dataset  (www-Barabasi)  for 
which  we  give  the  dimensions  in  Figure  3.7.  We  ran  the  experiments  on  an  Intel (R)  Xeon(R)  CPU 
5160  at  3.00GHz,  with  16GB  memory.  The  structure  identification  is  implemented  in  Matlab, 
while  the  selection  process  in  Python. 

Observation  5.  All  the  steps  of  VoG  are  designed  to  be  scalable.  Figure  3.15  shows  the 
complexity  is  O(m),  i.e.,  VoG  is  near-linear  on  the  number  of  edges  of  the  input  graph. 
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Figure  3.15:  VoG  is  near-linear  on  the  number 
of  edges.  Runtime,  in  seconds,  of  VoG  (Plain) 
vs.  number  of  edges  in  graph.  For  reference  we 
show  the  linear  and  quadratic  slopes. 


Name 

Nodes 

Edges 

WWW-Barabasi-50k 

49  780 

50624 

WWW-Barabasi-lOOk 

99  854 

205  432 

WWW-Barabasi-200k 

200155 

810  950 

WWW-Barabasi-300k 

325  729 

1  090 108 

Table  3.7:  Scalability:  Induced  subgraphs  of 

WWW-Barabasi. 


3.5  Discussion 

The  experiments  show  that  VoG  successfully  solves  an  important  open  problem  in  graph  under¬ 
standing:  how  to  find  a  succinct  summary  for  a  large  graph.  In  this  section  we  discuss  some  of  our 
design  decisions,  as  well  as  the  advantages  and  limitations  of  VoG. 

W7iy  does  VoG  use  the  chosen  vocabulary  structures  consisting  of  stars,  (near-)  cliques,  (near-) 
bipartite  cores  and  chains,  and  not  other  structures?  The  reason  for  choosing  these  and  not  other 
structures  is  that  these  structures  appear  very  often,  in  tens  of  real  graphs,  (e.g.,  in  patent  citation 
networks,  phonecall  networks,  in  the  Netflix  recommendation  system,  etc.),  while  they  also  have 
semantic  meaning,  such  as  factions  or  popular  entities.  Moreover,  these  graph  structures  are  well- 
known  and  conceptually  simple,  making  the  summaries  that  VoG  discovers  easily  interpretable. 

It  is  possible  that  a  graph  does  not  contain  the  vocabulary  terms  we  have  predefined,  but 
has  more  complex  structures.  However,  this  does  not  mean  that  the  summary  that  VoG  would 
generate  will  be  empty.  A  core  property  of  MDL  is  that  it  identifies  the  model  in  a  model  class  that 
best  describes  your  graph  regardless  of  whether  the  true  model  is  in  that  class.  Thus,  VoG  will 
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return  the  model  that  gives  the  most  succinct  description  of  the  input  graph  in  the  vocabulary 
terms  at  hand:  our  model  class.  In  this  case,  VoG  will  give  a  more  crude  description  of  the  graph 
structure  than  would  be  ideal — it  will  have  to  spend  more  bits  than  ideal,  which  means  we  are 
guarded  against  overfitting.  For  the  theory  behind  MDL  for  model  selection,  see  [  ] . 

Can  VoG  handle  the  case  where  a  new  structure  ( e.g “loops”)  proves  to  be  frequent  in  real 
graphs?  If  a  new  structure  is  frequent  in  real  graphs,  or  other  structures  are  important  for  specific 
applications,  VoG  can  easily  be  extended  to  handle  new  vocabulary  terms.  The  key  insight  for  the 
vocabulary  term  encodings  in  Section  3.2  is  to  encode  the  necessary  information  as  succinctly  as 
possible.  In  fact,  as  by  MDL  we  can  straightforwardly  compare  two  or  more  model  classes,  MDL 
will  immediately  tell  us  whether  a  vocabulary  set  Vi  is  better  than  a  vocabulary  set  V2:  the  one 
that  gives  the  best  compression  cost  for  the  graph  wins. 

On  the  other  hand,  if  we  already  know  that  certain  nodes  form  a  clique,  star,  etc.,  it  is  trivial  to 
adapt  VoG  to  use  these  structures  as  the  base  model  M  of  the  graph  (as  opposed  to  the  empty 
model).  When  describing  the  graph,  VoG  will  then  only  report  those  structures  that  best  describe 
the  remainder  of  the  graph. 

Why  does  VoG  fix  the  vocabulary  a  priori  instead  of  automatically  determining  the  most  appro¬ 
priate  vocabulary  for  a  given  graph?  An  alternative  approach  to  fixing  the  vocabulary  would  be  to 
automatically  determine  the  ‘right’  vocabulary  for  a  given  graph  by  doing  frequent  graph  mining. 
We  did  not  pursue  this  approach  because  of  scalability  and  interpretability.  For  a  vocabulary  term 
to  be  useful,  it  needs  to  be  easily  understood  by  the  user.  This  also  relates  to  why  we  define 
our  own  encoding  and  optimization  algorithm,  instead  of  using  off-the-shelf  general-purpose 
compressors  based  on  Lempel-Ziv  (such  as  gzip)  or  statistical  compressors  (such  as  PPMZ);  these 
provide  state-of-the-art  compression  by  exploiting  complex  statistical  properties  of  the  data,  mak¬ 
ing  their  models  very  complex  to  understand.  Local-structure  based  summaries,  on  the  other 
hand,  are  much  more  easily  understood.  Frequent-patterns  have  been  proven  to  be  interpretable 
and  powerful  building  blocks  for  data  summarization  [  SI  ,  KS09,  TV12].  While  a  powerful 
technique,  spotting  frequent  subgraphs  has  the  notoriously-expensive  subgraph  isomorphism 
problem  in  the  inner  loop.  This  aside,  published  algorithms  on  discovering  frequent  subgraphs 
(e.g.,  [YHO  ]),  are  not  applicable  here,  since  they  expect  the  nodes  to  have  labels  (e.g.,  carbon 
atom,  oxygen  atom,  etc.)  whereas  we  focus  on  large  unlabeled  graphs. 

Can  VoG  be  extended  such  that  it  can  take  specific  edge  distributions  into  account  and  only 
report  structures  that  stand  out  from  such  a  distribution?  In  this  work  we  aim  to  assume  as  little  as 
necessary  for  the  edge  distribution,  such  that  VoG  is  both  parameter-free  and  non-parametric  at 
its  core.  However,  it  is  possible  to  use  a  specific  edge  distribution.  As  long  as  we  can  calculate 
the  probability  of  an  adjacency  matrix,  P(E),  we  can  trivially  define  L(E)  =  —  logP(E).  Thus, 
for  instance,  if  we  were  to  consider  a  distribution  with  a  higher  clustering  coefficient  (that  is, 
dense  areas  are  more  likely),  the  cost  for  having  a  dense  area  in  E  will  be  relatively  low,  and 
hence  VoG  will  only  report  structures  that  stand  out  from  this  (assumed)  background  distribution. 
Recent  work  by  Araujo  et  al.  [  ]  explores  discovering  communities  that  exhibit  a  different 
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hyperbolic  — power-law  degree  distributed —  connectivity  than  the  background  distribution.  It 
will  be  interesting  to  extend  VoG  with  hyperbolic  distributions  for  both  subgraphs,  as  well  as  for 
encoding  the  error  matrix. 

Why  use  SlashBurn  for  the  graph  decomposition?  To  generate  candidate  subgraphs,  we  use 
SlashBurn  because  it  is  scalable,  and  designed  to  handle  graphs  without  caveman  structure. 
However,  any  graph  decomposition  method  could  be  used  instead,  or  even  better  a  combination 
of  decomposition  methods  could  be  applied.  We  conjecture  that  the  more  graph  decomposition 
methods  provide  the  candidate  structures  for  VoG,  the  better  the  resulting  summary  will  be. 
Essentially,  there  is  no  correct  graph  partitioning  technique,  since  each  one  of  them  works  by 
optimizing  a  different  goal.  MDL,  which  is  an  indispensable  component  of  VoG,  will  be  able  to 
discover  the  best  structures  among  the  set  of  candidate  structures. 

Can  VoG  give  high-level  insight  in  how  the  structures  in  a  summary  are  connected?  Although 
VoG  does  not  explicitly  encode  the  linkage  between  the  discovered  structures,  it  can  give  high-level 
insight  into  how  the  structures  in  a  summary  are  connected.  From  the  design  point  of  view,  we 
allow  nodes  to  participate  in  multiple  structures,  and  such  nodes  implicitly  ‘link’  two  structures. 
For  example,  a  node  can  be  part  of  a  clique,  as  well  as  the  starting-point  of  a  chain,  therefore 
‘linking’  the  clique  and  the  chain.  The  linkage  structure  of  the  summary  can  hence  be  trivially 
extracted  by  inspecting  whether  the  node  sets  of  structures  in  the  summary  overlap.  It  may  depend 
on  the  task  whether  one  prefers  to  show  the  high  level  linkage  of  the  structure,  or  give  the  details 
per  structure  in  the  summary. 


3.6  Related  Work 

Work  related  to  VoG  comprises  Minimum  Description  Length-based  approaches,  as  well  as  graph 
partitioning  and  visualization. 

3.6.1  MDL  and  Data  Mining 

Faloutsos  and  Megalooikonomou  [FI  ]  argue  that  since  many  data  mining  problems  are  related 
to  summarization  and  pattern  discovery,  they  are  intrinsically  related  to  Kolmogorov  complexity. 
Kolmogorov  complexity  [  LV93]  identifies  the  shortest  lossless  algorithmic  description  of  a  dataset, 
and  provides  sound  theoretical  foundations  for  both  identifying  the  optimal  model  for  a  dataset, 
and  defining  what  structure  it  is.  While  not  computable,  it  can  be  practically  implemented  by 
the  Minimum  Description  Length  principle  [  s78,  G  ]  — lossless  compression.  Examples 
of  applications  in  data  mining  include  clustering  [CV05],  classification  [vLVS06],  discovering 
communities  in  matrices  [  1FO  ] ,  model  order  selection  in  matrix  factorization  [  IV1  ,  ] , 

outlier  detection  [  ,  i],  pattern  set  mining  [V  ,  ],  finding  sources  of  infection 

in  large  graphs  [  12],  and  making  sense  of  selected  nodes  in  graphs  [  13] — to  name  a  few. 

We  are,  to  the  best  of  our  knowledge,  the  first  to  employ  MDL  with  the  goal  of  summarizing  a 
graph  with  a  vocabulary  beyond  simple  rectangles,  and  allowing  overlap  between  structures. 


3.6.2  Graph  Compression  and  Summarization 

Boldi  [BV04a]  studied  the  compression  of  web  graphs  using  the  lexicographic  localities;  Chierichetti 
et  al.  [  ]  extended  it  to  the  social  networks;  Apostolico  et  al.  [  ]  used  BFS  for  compres¬ 
sion.  Maserrat  et  al.  [  10]  used  multi-position  linearizations  for  neighborhood  queries.  Feng  et 

al.  [  13]  encode  edges  per  triangle  using  a  lossy  encoding.  SlashBurn  [  ]  exploits  the 

power-law  behavior  of  real-world  graphs,  addressing  the  ‘no  good  cut’  problem  [  M08] .  Tian  et 

al.  [  )8]  present  an  attribute-based  graph  summarization  technique  with  non-overlapping  and 

covering  node  groups;  an  automation  of  this  method  is  given  by  Zhang  et  al.  [  ] .  Toivonen  et 

al.  [TZ  ]  use  a  node  structural  equivalence-based  approach  for  compressing  weighted  graphs. 

An  alternative  way  of  “compressing”  a  graph  is  by  sampling  nodes  or  edges  from  it  [  706, 
HKBG08].  The  goal  of  sampling  is  to  obtain  a  graph  of  smaller  size,  which  maintains  some 
properties  of  the  initial  graph,  such  as  the  degree  distribution,  the  size  distribution  of  connected 
components,  the  diameter,  or  latent  properties  including  the  community  structure  [  ]  (i.e., 

the  graph  sample  contains  nodes  from  all  the  existing  communities) .  Although  graph  sampling 
may  allow  better  visualization  [  1C05],  unlike  VoG,  it  cannot  detect  graph  structures  and  may 
need  additional  processing  in  order  to  make  sense  of  the  sample. 

None  of  the  above  provide  summaries  in  terms  of  connectivity  structures  over  non-trivial 
sub-graphs.  Also,  we  should  stress  that  we  view  compression  not  as  the  goal,  but  as  the  means  of 
identifying  summaries  that  help  us  understand  the  graph  in  simple  terms,  i.e.,  to  discover  sets  of 
informative  structures  that  explain  the  graph  well.  Furthermore,  our  VoG  is  designed  for  large- 
scale  block-based  matrix  vector  multiplication  where  each  square  block  is  stored  independently 
from  each  other  for  scalable  processing  in  distributed  platforms  like  MapReduce  [DG04] .  The 
above-mentioned  works  are  not  designed  for  this  purpose:  the  information  of  the  outgoing  edges 
of  a  node  is  tightly  interconnected  to  the  outgoing  edges  of  its  predecessor  or  successor,  making 
them  inappropriate  for  square  block-based  distributed  matrix  vector  multiplication. 


3.6.3  Graph  Partitioning 

Assuming  away  the  ‘no  good  cut’  issue,  there  are  countless  graph  partitioning  algorithms:  Koopman 
and  Siebes  [  08,  KS09]  summarize  multi-relational  data,  or,  heavily  attributed  graphs.  Their 

method  assumes  the  adjacency  matrix  is  already  known,  as  it  aims  at  describing  the  node  attribute- 
values  using  tree-shaped  patterns. 

Subdue  [  ]H94]  is  a  famous  frequent-subgraph  based  summarization  scheme.  It  iteratively 
replaces  the  most  frequent  subgraph  in  a  labeled  graph  by  a  meta-node,  which  allows  it  to  discover 
small  lossy  descriptions  of  labeled  graphs.  In  contrast,  we  consider  unlabeled  graphs.  Moreover, 
as  our  encoding  is  lossless,  by  MDL,  we  can  fairly  compare  radically  different  models  and  model 
classes.  Navlakha  et  al.  [NRS08]  follow  a  similar  approach  to  Cook  and  Holder  [CH94],  by 
iteratively  grouping  nodes  that  see  high  interconnectivity.  Their  method  is  hence  confined  to 
summarizing  a  graph  in  terms  of  non-overlapping  cliques  and  bipartite  cores.  In  comparison,  the 
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work  of  Miettinen  and  Vreeken  [MV1  ,  MV14]  is  closer  to  ours,  even  though  they  discuss  MDL  for 
Boolean  matrix  factorization.  For  directed  graphs,  such  factorizations  are  in  fact  summaries  in 
terms  of  possibly  overlapping  full  cliques.  With  VoG  we  go  beyond  a  single-term  vocabulary,  and, 
importantly,  we  can  detect  and  reward  explicit  structure  within  subgraphs. 

Chakrabarti  et  al.  [  ]  proposed  the  cross-association  method,  which  provides  a  hard  clus¬ 
tering  of  the  nodes  into  groups,  effectively  looking  for  near-cliques.  Papadimitriou  et  al.  [  ] 

extended  this  to  hierarchies,  again  of  hard  clusters.  Rosvall  and  Bergstrom  [RBO  ]  propose 
information-theoretic  approaches  for  community  detection  for  hard-clustering  the  nodes  of  the 
graph.  With  VoG  we  allow  nodes  to  participate  in  multiple  structures,  and  can  summarize  graphs 
in  terms  of  how  subgraphs  are  connected,  beyond  identifying  that  they  are  densely  connected. 

An  alternative  approach  to  partition  the  nodes  (actors)  of  a  graph  into  groups,  and  capture  the 
network  relations  between  them  is  provided  by  the  blockmodels  representation  [FW9  ],  which 
comes  from  psychometrics  and  sociology,  and  is  often  encountered  in  social  network  analysis. 
The  idea  of  blockmodels  is  relevant  to  our  approach,  which  summarizes  a  graph  in  terms  of 
simple  structures,  and  reveals  the  connections  between  them.  Particularly  the  mixed-membership 
assumption  that  we  make  in  this  section  is  related  to  the  stochastic  blockmodels  [  08,  ] . 

These  probabilistic  models  combine  global  parameters  that  instantiate  dense  patches  of  connectivity 
(blockmodel)  with  local  parameters  that  capture  nodes  belonging  to  multiple  blockmodels.  Unlike 
our  method,  blockmodels  need  the  number  of  partitions  as  input,  spot  mainly  dense  patches  (such 
as  cliques  and  bipartite  cores,  without  explicitly  characterizing  them)  in  the  adjacency  matrix,  and 
make  a  set  of  statistical  assumptions  about  the  interaction  patterns  between  the  nodes  (generative 
process).  A  variety  of  graph  clustering  methods,  including  blockmodels,  could  be  used  in  the 
first  step  of  VoG  (Algorithm  3.1)  in  order  to  generate  candidate  subgraphs,  which  would  then  be 
ranked,  and  maybe  included  in  the  graph  summary. 


3.6.4  Graph  Visualization 

Apolo  [  ]  is  a  graph  tool  used  for  attention  routing.  The  user  picks  a  few  seeds  nodes  and 

Apolo  interactively  expands  their  vicinities  enabling  sense-making  this  way.  An  anomaly  detection 
system  for  large  graphs,  OPAvion  [  ],  mines  graph  features  using  Pegasus  [  ]),  spots 

anomalies  by  employing  OddBall  [  F10],  and  lastly  interactively  visualizes  the  anomalous  nodes 

via  Apolo.  In  [  08],  Shneiderman  proposes  simply  scaled  density  plots  to  visualize  scatter  plots, 

[BS04]  presents  random  and  density  sampling  techniques  for  datasets  with  several  thousands  of 
points,  while  NetRay  [KLKF14]  focuses  on  informative  visualizations  of  the  spy,  distribution  and 
correlation  plots  of  web-scale  graphs.  Dunne  and  Shneiderman  [  )S1  ]  introduce  the  idea  of  motif 
simplification  to  enhance  network  visualization.  Some  of  these  motifs  are  part  of  our  vocabulary, 
but  VoG  also  allows  for  near-structures,  which  are  common  in  real-world  graphs. 
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What  sets  VoG  apart: 

Unlike  VoG,  none  of  the  above  methods  meet  all  the  following  specifications:  (a)  gives  a  soft 
clustering,  (b)  is  scalable,  (c)  has  a  large  vocabulary  of  graph  primitives  (beyond  cliques/caveman- 
graphs)  and  (d)  is  parameter-free.  Moreover,  graph  visualization  techniques  focus  on  anomalous 
nodes  or  how  to  visualize  the  patterns  of  the  whole  graph.  In  contrast,  VoG  does  not  attempt 
to  find  specific  nodes,  but  informative  structures  that  summarize  the  graph  well,  and  therefore 
provides  a  small  set  of  nodes  and  edges  that  are  worth  (by  the  MDL  principle)  visualizing. 


Table  3.8:  Feature-based  comparison  of  VoG  with  alternative  approaches. 


*e> 

s 

& 

& 

✓ 

& 

& 

& 

s 

Visualization  methods 

X 

7 

7 

7 

[CKHF11,  KLKF14,  DS13] 

r 

A 

Sr 

Graph  partitioning 

X 

y/ 

x 

x 

x 

y/ 

x 

[KK99,  DGK05,  AKY99] 

Community  detection 

7 

x 

x 

x 

[SBGF14,  NG04,  BGLL08] 

VoG  [KKVF14,  KKVF15] 

✓ 

✓ 

✓ 

✓ 

✓ 

✓ 

✓ 

3.7  Summary 

We  have  studied  the  problem  of  succinctly  describing  a  large  graph  in  terms  of  connectivity 
structures.  Our  contributions  are: 

•  Problem  Formulation:  We  proposed  an  information  theoretic  graph  summarization  technique 
that  uses  a  carefully  chosen  vocabulary  of  graph  primitives  (Section  3.2). 

•  Effective  and  Scalable  Algorithm:  We  gave  VoG,  an  effective  method  which  is  near-linear  on 
the  number  of  edges  of  the  input  graph  (Section  3.4.3). 

•  Experiments  on  Real  Graphs:  We  discussed  interesting  findings  like  exchanges  between 
Wikipedia  vandals  and  responsible  editors  on  large  graphs,  and  also  analyzed  VoG  quantitatively 
as  well  as  qualitatively  (Figure  3.1,  Section  3.4). 

Future  work  includes  extending  the  VoG  vocabulary  to  more  complex  graph  structures  that  we 
know  appear  in  real  graphs,  such  as  core-peripheries,  (bipartite  core  whose  one  set  also  forms  a 
clique),  and  so-called  “jellyfish”  (cliques  of  stars),  as  well  as  implementing  VoG  in  the  distributed 
computing  framework  like  Map-Reduce. 
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Chapter  based  on  work  that  appeared  at  PKDD  2011  [KKK+ :  ]  and  VLDB  2015  [GGKF1  ]. 


Chapter  4 

Inference  in  a  Graph:  Two  Classes 


In  Chapter  3  we  saw  how  we  can  summarize  a  large  graph  and  gain  insights  into  its  important 
and  semantically  meaningful  structures.  In  this  chapter  we  examine  how  we  can  use  the  network 
effects  to  learn  about  the  nodes  in  the  remaining  network  structures.  In  contrast  to  the  previous 
chapter,  we  assume  that  the  nodes  have  a  class  label,  such  as  ‘libera’  or  ‘conservative’. 

Network  effects  are  very  powerful,  resulting  even  in  popular  proverbs  such  as  “birds  of  a 
feather  flock  together”  or  “opposites  attract”.  For  example,  in  social  networks,  we  often  observe 
homophily:  obese  people  tend  to  have  obese  friends  [CF07] ,  happy  people  tend  to  make  their 
friends  happy  too  [FC08],  and  in  general,  people  tend  to  associate  with  like-minded  friends,  with 
respect  to  politics,  hobbies,  religion,  etc.  Homophily  is  encountered  in  other  settings  too:  If  a  user 
likes  some  pages,  she  would  probably  like  other  pages  that  are  heavily  connected  to  her  favorite 
ones  ( personalized  PageRank );  if  a  user  likes  some  products,  he  will  probably  like  similar  products 
too  ( content-based,  recommendation  systems );  if  a  user  is  dishonest,  his/her  contacts  are  probably 
dishonest  too  ( accounting  and  calling-card  fraud) .  Occasionally,  the  reverse,  called  heterophily, 
is  true.  For  example,  in  an  online  dating  site,  we  may  observe  that  talkative  people  prefer  to 
date  silent  ones,  and  vice  versa.  Thus,  by  knowing  the  labels  of  a  few  nodes  in  a  network,  as 
well  as  whether  homophily  or  heterophily  applies  in  a  given  scenario,  we  can  usually  give  good 
predictions  about  the  labels  of  the  remaining  nodes. 

In  this  chapter  we  cover  the  cases  with  two  classes  (e.g.,  talkative  vs.  silent),  and  focus  on 
finding  the  most  likely  class  labels  for  all  the  nodes  in  the  graph.  In  Chapter  5,  we  extend  our 
work  to  cover  cases  with  more  than  two  classes  as  well.  Informally,  the  problem  is  defined  as: 

Problem  Definition  3.  [Guilt-by-association  -  Informal] 

•  Given:  a  graph  with  n  nodes  and  m  edges;  n+  and  n_  nodes  labeled  as  members  of  the 
positive  and  negative  class,  respectively 

•  Find:  the  class  memberships  of  the  rest  of  the  nodes,  assuming  that  neighbors  influence 
each  other. 

The  influence  can  be  homophily  or  heterophily.  This  learning  scenario,  where  we  reason  from 
observed  training  cases  directly  to  test  cases,  is  also  called  transductive  inference,  as  opposed  to 
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inductive  learning,  where  the  training  cases  are  used  to  infer  general  rules  which  are  then  applied 
to  new  test  cases. 

There  are  several,  closely  related  methods  that  are  used  for  transductive  inference  in  networked 
data.  Most  of  them  handle  homophily,  and  some  of  them  also  handle  heterophily.  We  focus  on 
three  methods:  Personalized  PageRank  (a.k.a.  “Personalized  Random  Walk  with  Restarts”,  or  just 
RWR)  [  Semi-Supervised  Learning  (SSL)  [  ];  and  Belief  Propagation  (BP)  [  ]. 

How  are  these  methods  related?  Are  they  identical?  If  not,  which  method  gives  the  best  accuracy? 
Which  method  has  the  best  scalability? 

These  questions  are  the  focus  of  this  work.  In  a  nutshell,  we  answer  the  above  questions,  and 
contribute  a  fast  algorithm  inspired  by  our  theoretical  analysis: 

•  Theory  and  Correspondences:  The  three  methods  are  closely  related,  but  not  identical. 

•  Effective  and  Scalable  Algorithm:  We  propose  FaBP,  a  fast,  accurate  and  scalable  algo¬ 
rithm,  and  provide  the  conditions  under  which  it  converges. 

•  Experiments  on  Real  Graphs:  We  propose  a  HADOOP-based  algorithm,  that  scales  to 
billion-node  graphs,  and  we  report  experiments  on  one  of  the  largest  graphs  ever  studied  in 
the  open  literature.  FaBP  achieves  about  2x  better  runtime  than  BP. 

This  chapter  is  organized  as  follows:  Section  4.1  provides  necessary  background  for  the  three 
guilt-by-association  methods  we  consider;  Section  4.2  shows  the  equivalence  of  the  methods 
and  introduces  our  proposed  method,  FaBP;  Section  4.3  provides  the  derivation  of  FaBP  and 
Section  4.4  presents  some  sufficient  conditions  for  our  method’s  convergence;  Section  4.5  gives 
the  proposed  FaBP  algorithm  followed  by  experiments  in  Section  4.6  and  a  summary  of  our  work 
in  Section  4.7. 


4. 1  Background 

We  provide  background  information  for  three  guilt-by-association  methods:  RWR,  SSL  and  BP. 

4.1.1  Random  Walk  with  Restarts  (RWR) 

RWR  is  the  method  underlying  Google’s  classic  PageRank  algorithm  [BP98],  which  is  used  to 
measure  the  relative  importance  of  webpages.  The  idea  behind  PageRank  is  that  there  is  an 
imaginary  surfer  who  is  randomly  clicking  on  links.  At  any  step,  the  surfer  clicks  on  a  link  with 
probability  c.  The  PageRank  of  a  page  is  defined  recursively  and  depends  on  the  PageRank  and 
the  number  of  the  webpages  pointing  to  it;  the  more  pages  with  high  importance  link  to  a  page, 
the  more  important  the  page  is.  The  PageRank  vector  r  is  defined  to  be  the  solution  to  the  linear 
system: 

r  =  (1  —  c)y  +  cBr, 

where  1  —  c  is  the  restart  probability  and  c  €  [0, 1].  In  the  original  algorithm,  B  =  D  'A  which 
corresponds  to  the  column-normalized  adjacency  matrix  of  the  graph,  and  the  starting  vector  is 

defined  as  y  =  ^  (uniform  starting  vector,  where  1  is  the  all-ones  column-vector) . 
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A  variation  of  PageRank  is  the  lazy  random  walks  [  VICO  7]  that  allows  the  surfer  to  remain  at  the 
same  page  at  any  point.  This  option  is  encoded  by  the  matrix  B  =  ^(1  +  D  'A).  The  two  methods 
are  equivalent  up  to  a  change  in  c  [ACL06] .  Another  variation  of  PageRank  includes  works 
where  the  starting  vector  is  not  uniform,  but  depends  on  the  topic  distribution  [  W02,  Hav03]. 
These  vectors  are  called  Personalized  PageRank  vectors  and  they  provide  personalized  or  context- 
sensitive  search  ranking.  Several  works  focus  on  speeding  up  RWR  [  D04,  FRCP ,  FPOC  ] . 

Related  methods  for  node-to-node  distance  (but  not  necessarily  guilt-by-association )  include 
[KNV06],  parameterized  by  escape  probability  and  round-trip  probability,  SimRank  [JWO:  ]  and 
extensions/improvements  [YLZ+L  ],  [  ].  Although  RWR  is  primarily  used  for  scoring 

nodes  relative  to  seed  nodes,  a  formulation  where  RWR  classifies  nodes  in  a  semi-supervised 
setting  has  been  introduced  by  Lin  and  Cohen  [  ,CH  ].  The  connection  between  random  walks 
and  electric  network  theory  is  described  in  Doyle  and  Snell’s  book  [  S84] . 

4.1.2  Semi-supervised  learning  (SSL) 

SSL  approaches  are  divided  into  four  categories  [  ]:  (i)  low-density  separation  methods,  (ii) 

graph-based  methods,  (iii)  methods  for  changing  the  representation,  and  (iv)  co-training  methods.  A 
survey  of  various  SSL  approaches  is  given  in  [  ^huO  ],  and  an  application  of  transductive  SSL  for 
multi-label  classification  in  heterogeneous  information  networks  is  described  in  [  ].  SSL 

uses  both  labeled  and  unlabeled  data  for  training,  as  opposed  to  supervised  learning  that  uses  only 
labeled  data,  and  unsupervised  learning  that  uses  only  unlabeled  data.  The  principle  behind  SSL 
is  that  unlabeled  data  can  help  us  decide  the  “metric”  between  data  points  and  improve  models’ 
performance.  SSL  can  be  either  transductive  (it  works  only  on  the  labeled  and  unlabeled  training 
data)  or  inductive  (it  can  be  used  to  classify  unseen  data) . 

Most  graph-based  SSL  methods  are  transductive,  non-parametric  and  discriminative.  The 
graphs  used  in  SSL  consist  of  labeled  and  unlabeled  nodes  (examples),  and  the  edges  represent 
the  similarity  between  them.  SSL  can  be  expressed  in  a  regularization  framework,  where  the  goal 
is  to  estimate  a  function  f  on  the  graph  with  two  parts: 

•  Loss  function,  which  expresses  that  f  is  smooth  on  the  whole  graph  (equivalently  similar 
nodes  are  connected-“homophily”). 

•  Regularization,  which  forces  the  final  labels  for  the  labeled  examples  to  be  close  to  their 
initial  labels. 

Here  we  refer  to  a  variant  of  graph  mincut  introduced  in  [BC01].  Mincut  is  the  mode  of  a 
Markov  random  field  with  binary  labels  (Boltzmann  machine).  Given  l  labeled  points  (xi,pi), 
i.  =  1, . . . ,  l,  and  u  unlabeled  points  xi+i, . . . ,  xj.+u,  the  final  labels  x  are  found  by  minimizing  the 
energy  function: 


(4.1) 


where  a  is  related  to  the  coupling  strength  (homophily)  of  neighboring  nodes,  N  (i)  denotes  the 
neighbors  of  i,  and  atj  is  the  (i,  j ] Lh  element  of  the  adjacency  matrix  A. 
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P[Xi  =  +|Xj=-] 

P[Xi  =  -|Xj=-] 

(a)  Explanation  of  entries  (b)  Homophily  (c)  Heterophily 

Figure  4.1:  Propagation  or  coupling  matrices  capturing  the  network  effects,  (a):  Explanation  of  the 
entries  in  the  propagation  matrix.  P  stands  for  probability;  and  xj  represent  the  state/class/label 
of  node  i  and  j,  respectively.  Color  intensity  corresponds  to  the  coupling  strengths  between  classes  of 
neighboring  nodes,  (b)-(c):  Examples  of  propagation  matrices  capturing  different  network  effects, 
(b):  D:  Democrats,  R:  Republicans,  (c):  T:  Talkative,  S:  Silent. 


4.1.3  Belief  Propagation  (BP) 

Belief  Propagation  (BP),  also  called  the  sum-product  algorithm,  is  an  exact  inference  method 
for  graphical  models  with  a  tree  structure  [Pea88].  In  a  nutshell,  BP  is  an  iterative  message¬ 
passing  algorithm  that  computes  the  marginal  probability  distribution  for  each  unobserved  node 
conditional  on  observed  nodes:  all  nodes  receive  messages  from  their  neighbors  in  parallel,  update 
their  belief  states,  and  finally  send  new  messages  back  out  to  their  neighbors.  In  other  words,  at 
iteration  t  of  the  algorithm,  the  posterior  belief  of  a  node  I  is  conditioned  on  the  evidence  of  its 
t-step  away  neighbors  in  the  underlying  network.  This  process  repeats  until  convergence  and  is 
well-understood  on  trees. 

When  applied  to  loopy  graphs,  however,  BP  is  not  guaranteed  to  converge  to  the  marginal 
probability  distribution,  nor  to  converge  at  all.  In  these  cases  it  can  be  used  as  an  approximation 
scheme  [  3ea88]  though.  Despite  the  lack  of  exact  convergence  criteria,  “loopy  BP”  has  been  shown 
to  give  accurate  results  in  practice  [YFW03] ,  and  it  is  thus  widely  used  today  in  various  applications, 
such  as  error-correcting  codes  [  ],  stereo  imaging  in  computer  vision  [  ],  fraud  detection 

[  IBA 1  09,  PCWF  ],  and  malware  detection  [CNW+  ].  Extensions  of  BP  include  Generalized 
Belief  Propagation  (GBP)  that  takes  a  multi-resolution  viewpoint,  grouping  nodes  into  regions 
[YFW05];  however,  how  to  construct  good  regions  is  still  an  open  research  problem.  Thus,  we 
focus  on  standard  BP,  which  is  better  understood. 

We  are  interested  in  BP  because  not  only  is  it  an  efficient  inference  algorithm  on  probabilistic 
graphical  models,  but  it  has  also  been  successfully  used  for  transductive  inference.  Our  goal  is 
to  find  the  most  likely  beliefs  (or  classes)  for  all  the  nodes  in  a  network.  BP  helps  to  iteratively 
propagate  the  information  from  a  few  nodes  with  initial  (or  explicit)  beliefs  throughout  the 
network.  More  formally,  consider  a  graph  of  n  nodes  and  k  =  2  possible  class  labels.  The  original 
update  formulas  [YFWO  ]  for  the  messages  sent  from  node  I  to  node  j  and  the  belief  of  node  i  for 
being  in  state  x^  are 


Xi  tlEN  (i)\j 


(4.2) 
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(4.3) 


bi(Xi)  <-T\  ■  4>i(Xi)  •  Yl 

jeN(i) 

where  the  message  from  node  i  to  node  j  is  computed  based  on  all  the  messages  sent  by  all  its 
neighbors  in  the  previous  step  except  for  the  previous  message  sent  from  node  j  to  node  i.  N  (i) 
denotes  the  neighbors  of  i,  p  is  a  normalization  constant  that  guarantees  that  the  beliefs  sum  to 
one,  mji  is  the  message  sent  from  node  j  to  node  i,  Y.i  bi(xi)  =  1,  and  aptj  represents  the  edge 
potential  when  node  i  is  in  state  xt  and  node  j  is  in  state  xj .  We  will  often  refer  to  ^>vj  as  homophily 
or  coupling  strength  between  nodes  i  and  j.  We  can  organize  the  edge  potentials  in  a  matrix  form, 
which  we  call  propagation  or  coupling  matrix.  We  note  that,  by  definition  (Figure  4.1(a)),  the 
propagation  matrix  is  a  right  stochastic  matrix  (i.e.,  its  rows  add  up  to  1).  Figure  4.1  illustrates  two 
example  propagation  matrices  that  capture  homophily  and  heterophily  between  two  states.  For 
BP,  the  above  update  formulas  are  repeatedly  computed  for  each  node  until  the  values  (hopefully) 
converge  to  the  final  beliefs. 

In  this  chapter,  we  study  how  the  parameter  choices  for  BP  helps  accelerate  the  algorithms,  and 
how  to  implement  the  method  on  top  of  Hadoop  [  ]  (open-source  MapReduce  implementation). 

This  focus  differentiates  our  work  from  existing  research  which  speeds  up  BP  by  exploiting  the 
graph  structure  [CG10,  PCWF07]  or  the  order  of  message  propagation  [GLG09]. 


4.1.4  Summary 

RWR,  SSL,  and  BP  have  been  used  successfully  in  many  tasks,  such  as  ranking  [  ],  classifica¬ 
tion  [  i06,  JSE  ],  malware  and  fraud  detection  [  ,  BA+ 09],  and  recommendation 

systems  [  ].  None  of  the  above  works,  however,  show  the  relationships  between  the  three 

methods,  or  discuss  their  parameter  choices  (e.g.,  homophily  scores).  Table  4.1  qualitatively  com¬ 
pares  the  three  guilt-by-association  methods  and  our  proposed  algorithm  FaBP:  (i)  All  methods 
are  scalable  and  support  homophily;  (ii)  BP  supports  heterophily,  but  there  is  no  guarantee  on 
convergence,  (iii)  Our  FaBP  algorithm  improves  on  it  to  provide  convergence.  In  the  following 
discussion,  we  use  the  symbols  that  are  defined  in  Table  4.2. 


Table  4.1:  Qualitative  comparison  of  ‘guilt-by-association’  (GbA)  methods. 


GbA  Method 

Heterophily  Scalability  Convergence 

RWR 

X 

✓ 

✓ 

SSL 

X 

✓ 

✓ 

BP 

✓ 

✓ 

? 

FaBP 

✓ 

✓ 

✓ 
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Table  4.2:  FaBP:  Major  Symbols  and  Definitions,  (matrices:  in  bold  capital  font;  vectors  in  bold 
lower-case;  scalars  in  plain  font) 


Symbols  Descriptions 

n  number  of  nodes  in  the  graph 

A  n  x  n  symmetric  adjacency  matrix 

D  n  x  n  diagonal  matrix  of  degrees,  Dit  =  Atj  and  =  0  for  i  ^  j 

I  n  x  n  identity  matrix 


rriij ,  m(i,  j )  message  sent  from  node  i  to  node  j 

mn(i,  j)  =  m(i,  j)  —  0.5,  “about-half”  message  sent  from  node  i  to  node  j 
n  x  1  vector  of  the  BP  prior  beliefs 

^  where  cf>  (i){>  0.5,  <  0.5}  means  i  e  {“+”,  class  initially,  and  <f>(i)  =0  means  i’s  class  in  unknown 

(bn  =  cb  —  0.5,  n  x  1  “about-half”  prior  belief  vector 

k  n  x  1  BP  final  belief  vector 

where  b(i){>  0.5,  <  0.5}  means  i  e  {“+”,  class  &  b(i)  =0  means  i  is  unclassified  (neutral) 


bh  =  b  —  0.5,  n  x  1  “about-half”  final  belief  vector 


=  h  —  0.5,  “about-half”  homophily  factor,  where  h  =  if>  entry  of  BP  propagation  matrix 

where  h  0  means  strong  heterophily  and  h  — 1  means  strong  homophily 


4.2  Proposed  Method,  Theorems  and  Correspondences 

In  this  section  we  present  the  three  main  formulas  that  show  the  similarity  of  the  following  methods: 
binary  BP,  and  specifically  our  proposed  approximation,  linearized  BP  (FaBP);  Personalized 
Random  Walk  with  Restarts  (RWR);  BP  on  Gaussian  Random  Fields  (GaBP);  and  Semi-Supervised 
Learning  (SSL). 

For  the  homophily  case,  all  the  above  methods  are  similar  in  spirit,  and  closely  related  to 
diffusion  processes.  Assuming  that  the  positive  class  corresponds  to  green  color  and  the  negative 
class  to  red  color,  the  n+  nodes  that  belong  to  the  positive  class  act  as  if  they  taint  their  neighbors 
with  green  color,  and  similarly  do  the  negative  nodes  with  red  color.  Depending  on  the  strength 
of  homophily,  or  equivalently  the  speed  of  diffusion  of  the  color,  eventually  we  have  green-ish 
neighborhoods,  red-ish  neighborhoods,  and  bridge  nodes  (half-red,  half-green). 

As  we  show  next,  the  solution  vectors  for  each  method  obey  very  similar  equations:  they  all 
involve  a  matrix  inversion,  where  the  matrix  consists  of  a  diagonal  matrix  plus  a  weighted  or 
normalized  version  of  the  adjacency  matrix.  Table  4.3  shows  the  resulting  equations,  carefully 
aligned  to  highlight  the  correspondences. 

Next  we  give  the  equivalence  results  for  all  three  methods,  and  the  convergence  analysis  for 
FaBP.  The  convergence  of  Gaussian  BP,  a  variant  of  Belief  Propagation,  is  studied  in  [MJW0(  ] 
and  [  VeiOO],  The  reasons  that  we  focus  on  BP  are  that  (a)  it  has  a  solid,  Bayesian  foundation,  and 
(b)  it  is  more  general  than  the  rest,  being  able  to  handle  heterophily  (as  well  as  multiple-classes, 
on  which  we  elaborate  in  Chapter  5) . 
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Table  4.3:  Main  results,  to  illustrate  correspondence.  Matrices  (in  capital  and  bold)  are  n  x  n; 
vectors  (lower-case  bold)  are  n  x  1  column  vectors,  and  scalars  (in  lower-case  plain  font)  typically 
correspond  to  strength  of  influence.  Detailed  definitions:  in  the  text. 


Method 

matrix 

unknown 

known 

RWR 

[I  —  cAD-1]  x 

X  = 

(1  -c)  y 

SSL 

[I  +  a(D  —  A)]  x 

X  = 

y 

GaBP  =  SSL 

[I  +  a(D  —  A)]  x 

X  = 

y 

FaBP 

[I  +  aD  —  c'A]  x 

bh 

4>h 

Theorem  1  (FaBP). - 

V _ _ _ J 

The  solution  to  belief  propagation  can  be  approximated  by  the  linear  system 

[I  +  aD-c'A]b  h  =  (t>n,  (4.4) 

where  a  =  4h^/(l  —  4h|j  and  c'  =  2H|X/(  1  —  4h^);  bn  is  the  “about-half”  homophily 
factor  which  represents  the  notion  of  the  propagation  matrix;  c|>n  is  the  vector  of  prior 
node  beliefs,  where  4>h(t)  =  0  for  the  nodes  with  no  explicit  initial  information;  bh  is 
the  vector  of  final  node  beliefs. 


Proof.  The  derivation  of  the  FaBP  equation  is  given  in  Section  4.3. 


Lemma  1  (Personalized  RWR) . 


The  linear  system  for  RWR  given  an  observation  y  is  described  by  the  following  formula: 

[I  — cAD_1]x  =  (1  —  c)y.  (4.5) 

where  y  is  the  starting  vector  and  1  —  c  is  the  restart  probability,  c  e  [0, 1]. 


Proof.  See  [Hav03],  [TFP06] . 


The  starting  vector  y  corresponds  to  the  prior  beliefs  for  each  node  in  BP,  with  the  small 
difference  that  yt  =  0  means  that  we  know  nothing  about  node  i,  while  a  positive  score  y;  > 
0  means  that  the  node  belongs  to  the  positive  class  (with  the  corresponding  strength).  In 
Section  7.1.2,  we  elaborate  on  the  equivalence  of  RWR  and  FaBP  (Lemma  1,  p.  119  and  Theorem  1, 
p.  119).  A  connection  between  personalized  PageRank  and  inference  in  tree-structured  Markov 
random  fields  is  drawn  by  Cohen  [  0]. 
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Lemma  2  (SSL). 

v _ 

Suppose  we  are  given  l  labeled  nodes  (xi.,yO,  i  =  1, . . . ,  l,  yi  G  (0, 1},  and  u  unlabeled 
nodes  Oq+i,  ...,xi+u).  The  solution  to  an  SSL  problem  is  given  by  the  linear  system: 

[a(D  —  A)  +  I]x  =  y  (4.6) 

where  a  is  related  to  the  coupling  strength  (homophily)  of  neighboring  nodes,  y  is  the 
label  vector  for  the  labeled  nodes  and  x  is  the  vector  of  final  labels. 


Proof.  Given  l  labeled  points  (xt,yt),  i  =  1, . . . ,  l,  and  u  unlabeled  points  xv+i, . . .  ,xv+u  for  a 
semi-supervised  learning  problem,  based  on  an  energy  minimization  formulation,  we  solve  for  the 
labels  xt  by  minimizing  the  following  functional  E 

E(x)  =  a  Y_  aij(xt-Xj)2+  ^(yi-xt)2,  (4.7) 

jeN(i)  lsjvsci 

where  a  is  related  to  the  coupling  strength  (homophily),  of  neighboring  nodes.  N(i)  denotes  the 
neighbors  of  i.  If  all  points  are  labeled,  in  matrix  form,  the  functional  can  be  re-written  as 

E(x)  =  xT[I  +  a(D  —  A)]x  —  2x  •  y  +  K(y) 

=  (x  —  x*)T[I  +  a(D  —  A)](x  —  x*)  +  K'(y) , 

where  x*  =  [I  +  afD  — AJ]  1  y,  and  K,  K'  are  some  constant  terms  which  depend  only  on  y.  Clearly, 
E  achieves  the  minimum  when 


x  =  x*  =  [I  +  a(D  —  A)]-1y 

SSL  is  explained  in  Section  4.1.2  and  in  more  depth  in  [  ihuOi  ].  The  connection  of  SSL  and  BP 
on  Gaussian  Random  Fields  (GaBP)  can  be  found  in  [  ^GL03,  Zhu06].  ■ 


As  before,  y  represents  the  labels  of  the  labeled  nodes  and,  thus,  it  is  related  to  the  prior  beliefs 
in  BP;  x  corresponds  to  the  labels  of  all  the  nodes  or,  equivalently,  the  final  beliefs  in  BP. 


/ - \ 

Lemma  3  (R-S  correspondence). 

v _ _ _ / 

On  a  regular  graph  (i.e.,  all  nodes  have  the  same  degree  d),  RWR  and  SSL  can  produce 
identical  results  if 


a  = 


c 

(1  -  c)d 


(4.8) 


That  is,  we  need  to  carefully  align  the  homophily  strengths  a  and  c. 
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Proof.  Based  on  Equations  4.5  and  4.6,  the  two  methods  will  give  identical  results  if 


1  Al  — 1 


(1  -c)[I-cD_1A] 


1 


I 


1  A  A-l 


(1-c)  (1-c) 

1  \  _  (  c 


D“'A) 


1  —  c 
c 

1  —  c 


I- 

I- 


1  —  c 


D-1A 


C  '  D_1A 


1  —  c 


1  —  c 


1  —  c , 

I  —  D_1A] 


D“‘[D-A] 


D 


1  —  C 


[I  +  a(D  —  A)]-1  44 
(a(D-A)  +I)_1  44 

I  -I-  a(D  —  A)  44 

a(D  —  A)  44 

a(D  —  A)  44 

a(D  —  A)  44 

al. 


If  the  graph  is  “regular”,  d*  =  d  (i  =  1, . . .),  or  D  =  d  •  I,  in  which  case  the  condition  becomes 


a  = 


(l-c)d 

where  d  is  the  common  degree  of  all  the  nodes. 


=4  c  = 


ad 


1  +  ad 


(4.9) 


Although  Lemma  3  refers  to  regular  graphs,  the  result  can  be  extended  to  arbitrary  graphs  as  well. 
In  this  case,  instead  of  having  a  single  homophily  strength  a,  we  introduce  a  homophily  factor  per 
node  i,  at  =  with  degree  dt. 

Arithmetic  Examples 

In  this  section  we  illustrate  that  SSL  and  RWR  give  closely  related  solutions.  We  set  a  to  be 
a  =  (where  d  is  the  average  degree). 

Figure  4.2  shows  the  scatter  plot:  each  red  star  (xt,  yO  corresponds  to  a  node,  say,  node  i; 
the  coordinates  are  the  RWR  and  SSL  scores,  respectively.  The  blue  circles  correspond  to  the 
perfect  identity,  and  thus  are  on  the  45-degree  line.  The  left  plot  in  Figure  4.2  has  three  major 
groups,  corresponding  to  the  ’+ ’-labeled,  the  unlabeled,  and  the  ’-’-labeled  nodes  (from  top-right 
to  bottom-left,  respectively).  The  right  plot  in  Figure  4.2  shows  a  magnification  of  the  central  part 
(the  unlabeled  nodes).  Notice  that  the  red  stars  are  close  to  the  45-degree  line.  The  conclusion  is 
that  (a)  the  SSL  and  RWR  scores  are  similar,  and  (b)  the  rankings  are  the  same:  whichever  node  is 
labeled  as  “positive"  by  SSL,  gets  a  high  score  by  RWR,  and  conversely. 
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(a)  RWR-SSL  Scatter  plot  x  10'3  (b)  RWR-SSL  Scatter  plot  (Zoomin) 


Figure  4.2:  Illustration  of  near-equivalence  of  SSL  and  RWR.  We  show  the  SSL  scores  vs.  the  RWR 
scores  for  the  nodes  of  a  random  graph;  blue  circles  (ideal,  perfect  equality)  and  red  stars  (real). 
Right:  a  zoom-in  of  the  left.  Most  red  stars  are  on  or  close  to  the  diagonal:  the  two  methods  give 
similar  scores,  and  identical  assignments  to  positive/negative  classes. 


4.3  Derivation  of  FaBP 

In  this  section  we  derive  FaBP,  our  closed  formula  for  belief  propagation  (Theorem  1).  The  key 
ideas  are  to  center  the  values  around  allow  only  small  deviations,  and  use,  for  each  variable, 
the  odds  of  the  positive  class. 

The  propagation  or  coupling  matrix  (see  Section  4.1.3  and  Figure  4.1)  is  central  to  BP  as  it 
captures  the  network  effects — i.e.,  the  edge  potential  (or  strength)  between  states/classes/labels. 
Often  the  propagation  matrix  is  symmetric,  that  is,  the  edge  potential  does  not  depend  on 
the  direction  in  which  a  message  is  transmitted  (e.g..  Figure  4.1(c)).  Moreover,  because  the 
propagation  matrix  is  also  left  stochastic,  we  can  entirely  describe  it  with  a  single  value,  such 
as  the  first  element  P[xt  =  +|xj  =  +]  =  4>(“+",“+")-  We  denote  this  value  with  hn  and  call  it 
“about-half”  homophily  factor  (Table  4.2). 

In  more  detail,  the  key  ideas  for  the  following  proofs  are: 

1.  We  center  the  values  of  all  the  quantities  involved  in  BP  around  Specifically,  we  use  the 

vectors  mh  =  m  —  bh  =  b  —  4>h  =  4>  —  and  the  scalar  h^  =  h  —  We  allow  the 

values  to  deviate  from  the  \  point  by  a  small  constant  e,  use  the  MacLaurin  expansions  in 
Table  4.4,  and  keep  only  the  first  order  terms.  By  doing  so,  we  avoid  the  sigmoid/non-linear 
equations  of  BP. 

2.  For  each  quantity  p,  we  use  the  odds  of  the  positive  class,  pr  =  p/(l  —  p),  instead  of 
probabilities.  This  results  in  only  one  value  for  node  i,  pr(i)  =  jy  |  ■  j  instead  of  two. 
Moreover,  the  normalization  factor  is  no  longer  needed. 

We  start  from  the  original  BP  equations  given  in  Section  4.1.3,  and  apply  our  two  main  ideas 
to  obtain  the  odds  expressions  for  the  BP  message  and  belief  equations.  In  the  following  equations, 
we  use  the  notation  var(i)  to  denote  that  the  variable  refers  to  node  i.  We  note  that  in  the  original 
BP  equations  (Equations  4.2  and  4.3),  we  had  to  write  vart(xi)  to  denote  that  the  var  refers  to 
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Table  4.4:  Logarithm  and  division  approximations  used  in  the  derivation  of  FaBP. 


Formula 

Maclaurin  series 

Approx. 

Logarithm 

Division 

ln(l  +  e) 

l 

1  — e 

p2 

—  £ _ 

t  2^3 

=  1  +  e  +  e2  +  e3  +  . . 

£ 

.  1  +  e 

node  i  and  the  state  x^.  However,  by  introducing  the  odds  of  the  positive  class,  we  no  longer  need 
to  note  the  state/class  of  node  i. 

Lemma  4.] 

Expressed  as  ratios,  the  BP  equations  for  the  message  and  beliefs  become: 

tttr(i,j)  ^  B [hr, br>a(jjusteci(i, j)]  (4.10) 

br(t)  tjJrW  '  n  0,1)  (4.11) 

jeN(i) 

where  br,adjusted(M)  =  br(i)/mr(j,i),  and  B(x,y)  =  is  a  blending  function. 


Proof.  Based  on  the  notations  introduced  in  our  second  key  idea,  b+(i)  =  bt(xt  =  +)  in  Equa¬ 
tion  4.3.  By  writing  out  the  definition  of  br  and  using  Equation  4.3  in  the  numerator  and 
denominator,  we  obtain 


br(i) 


b+(i)  Equation  4.3  B  '  (t)+(f)  '  E[jeN(i)  m+0>V) 

b-(t)  B  •  (J)-('i)  •  rijeN(i)  m-0n) 


4>r(i) 

JeN(i) 


4>r(i)  mr(j,i) 

j€N(i) 


This  concludes  the  proof  for  Equation  4.11.  We  can  write  Equation  4.11  in  the  following  form, 
which  we  will  use  later  to  prove  Equation  4.10: 


br(i)  =  (Mi)  n  mr(j>i) 
jeN(i) 

J~[  mr(n,i) 

neN(i)\j 


]^[  mr(n,i)mr(j,i) 

n€N(i)\j 
=  br(t) 

4>r(i)mr(j,i) 


br(t) 


(4.12) 


To  obtain  Equation  4.10,  we  start  by  writing  out  the  definition  of  mr(i,  j)  and  then  substitute 
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Equation  4.2  in  the  numerator  and  denominator  by  considering  in  the  all  possible  states 

xi  =  {+>  — }; 


mr(i,j) 


m+(i,j) 
m-(i,  j) 


Equation  4.2  ^x={+,— }  ^x  W  ’  (*>  “I-)  '  OnEN  (i)\j  (tt»  t) 

Ix={+;_}(t)x(i)’%i(x-)  •  r[neN(i)\jmx(n,i) 


_  <t>+M  -^ij(+,+)  •  llngN (i)\j  m+(n,j)  +  jijh  +)  •  ]lneN(i)\j  m_(n,i) 

”  4>+(i)  •  FIn€N(i)\j  -)  •  FIn€N(i)\j 

From  the  definition  of  h  in  Table  4.2  and  our  second  main  idea,  we  get  that  h+  =  4>(+,  +)  = 
4>(— ,  — )  (which  is  independent  of  the  nodes,  as  we  explained  at  the  beginning  of  this  section), 
while  h  =  iM+,  — )  =  +).  By  substituting  these  formulas  in  the  last  equation  and  by  dividing 

with  dQUX(i)  =  <Mi)h_  ]lneN(i)\j  rn_(n,i),  we  get 


4>+(f)  '  h+  •  HiigN  (i)\j  +  4>-W  '  b-  '  rineN(i)\j  TTt_(n,i) 

4>+(i)  •  H-  •  rin€N(i)\j  m+M  +  •  h+  •  rineN(i)\j 


*  daux(i-) 


h  i  _ 1 _ 

tfr(i-)r[ngN(i)\)mr(Tl..i) 

i  _i _ hi _ 

{t>r(i)rineN(i)\jmr(n,i) 


,  I  mT(j,i) 

Equation  4.12  llr  '  bT(i) 


hrmr(j,i) 

MM 


hr 


Mi] 

mT(j,i) 


+  1 


hr  + 


Mi) 

mr(i.i) 


hrbr,adjusted(h  j)  T  1 
hr  +  br,adjusted(i»  j) 


—  B[hr,  br, adjusted (hj)] 

We  note  that  the  division  by  mr(j,i)  (which  comes  from  Equation  4.12)  subtracts  the  influence 
of  node  j  when  preparing  the  message  m(i,  j).  The  same  effect  is  captured  by  the  original  message 
equation  (Equation  4.2).  ■ 


Before  deriving  the  “about-half”  beliefs  bh  and  “about-half”  messages  mh,  we  introduce  some 
approximations  for  all  the  quantities  of  interest  (messages  and  beliefs)  that  will  be  useful  in  the 
remaining  proofs. 
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Lemma  5  (Approximations). 


Let  {vr,  ar,br}  and  {vh,  an,  bjT}  be  the  odds  and  “about-half”  values  of  the  variables 
v,  a,  and  b,  respectively.  The  following  approximations  are  fundamental  for  the  rest  of 
our  lemmas,  and  hold  for  all  the  variables  of  interest  (mr,  br,  c))r,  and  hr). 


Vr  = 


1/2  +  Vh 


1  +4vh 


1  —v  1/2  —  vh 
B(ar,br)  «  1  +  8ahbh 

where  B(ar,  br)  =  a/^rb1^1  is  the  blending  function  for  any  variables  ar,  bT 


(4.13) 

(4.14) 


Sketch  of  proof.  For  Equation  4.13,  we  start  from  the  odds  and  “about-half”  approximations 
(vr  =  y//;  and  Vh  =  v  —  \  resp.),  and  then  use  the  Maclaurin  series  expansion  for  division 
(Table  4.4)  and  keep  only  the  first  order  terms: 

_  v  _  1/2  +  Vh 

Vr  1— v  1/2  — Vh 

1  +2vh  Table  4.4  .  „  w  ,  „  . 

—  - - — —  ~  (1  +  2vh)(l  +  2vh) 

1  -2vh 

=  l+4v2h+4vh'“?!d'rl+4vh^ 
vr  «  1  +  4vh 


For  Equation  4.14,  we  start  from  the  definition  of  the  blending  function,  then  apply  Equa¬ 
tion  4.13  for  all  the  variables,  and  use  the  Maclaurin  series  expansion  for  division  (Table  4.4).  As 
before,  we  keep  only  the  first  order  terms  in  our  approximation: 


B(ar,br) 


B(ar,br) 


CLr  -  br  +  1  Equation  4.13  ( 1  +  4a.h)  ( 1  T  4bh)  +  1 

CLr  T  br  ( 1  T  4flh)  +  ( 1  +  4b  h) 

_  y  16ahbh  _  i _ Sahbh 

2  +  4ah  +  4bh  1  +  2(ah  +  bh) 

Table  4.4 

~  1  H-  Sa^b^l  2(an  +  b^))  =  1  + 

1  +  8ahbh  H 


The  following  three  lemmas  are  useful  in  order  to  derive  the  linear  equation  of  FaBP.  We  note 
that  in  all  the  lemmas  we  apply  several  approximations  in  order  to  linearize  the  equations;  we 
omit  the  symbol  so  that  the  proofs  are  more  readable. 
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Lemma  6. 

The  about-half  version  of  the  belief  equation  becomes,  for  small  deviations  from  the 
half-point: 

bh(t)  «  4>n(t)  +  Y_  (4.15) 

jeN(i) 


Proof.  We  use  Equation  4.11  and  Equation  4.13,  and  apply  the  appropriate  MacLaurin  series 
expansions: 


br(i)  =  4>r(i)  ]^[  mr(j,i)  => 

jeN(i) 

log(l+4bH(i))=log(l+4<|)H(i))+  Y  lo9  (1  +  4mh(j,t))  =► 

jeN(i) 


bhW  =  4>h(l)  +  Y 


jeN(i) 


Lemma  7. 

The  about-half  version  of  the  message  equation  becomes: 

«2hh[bH(i)  (4.16) 


Proof.  We  combine  Equation  4.10,  Equation  4.13  and  Equation  4.14: 

ttlr(i,  j)  =  B [hr,  bTjCl(ijuSted (t,  j )]  TTLEu (!•>  J )  =  2h}xbjx  acijusted(t,  j )•  (4.17) 

In  order  to  derive  bn, adjusted  (h  j)  we  use  Equation  4.13  and  the  approximation  of  the  MacLaurin 
expansion  jd_  =  l  —  e  for  a  small  quantity  e: 

br,adjusted(t,  j )  =  br  (i)/mr  (j,  i)  =£- 

1  T  bj^adjustedth  j )  =  ( 1  +  4bj1(i) )  ( 1  4Trql(j,i))=r- 

bh.,adjusted(t,  j )  =  bh(i)  -  mh(j, i)  -  4bh(i)mh(j, i) .  (4.18) 

Substituting  Equation  4.18  to  Equation  4.17  and  ignoring  the  terms  of  second  order  leads  to  the 
about-half  version  of  the  message  equation.  ■ 
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Lemma  8. 

At  steady  state,  the  messages  can  be  expressed  in  terms  of  the  beliefs: 


2bh 

~  jz - [bH(i)  -2hHbH(j)] 

(1  -4h£) 


(4.19) 


Proof.  We  apply  Lemma  7  both  for  ran (i,  j  ]  and  rrih.(j,i)  and  we  solve  for  mix(i,  j).  ■ 

Based  on  the  above  derivations,  we  can  now  obtain  the  equation  for  FaBP  (Theorem  1),  which 
we  presented  in  Section  4.2. 


Proof,  [for  Theorem  1:  FaBP  ]  We  substitute  Equation  4.16  to  Equation  4.15  and  we  obtain: 
bh.(t)  —  21  mh04)  =  4>hW  =>• 

j€N(i) 

K(l)+  L.  T^r-  L  =  WO  =*• 

jGN(i)  H  jGN(i)  H 

bH(i)  +  a  21  bH(t)-c'  21  bh(j)  =  4>h(1) 

JeN(i)  jGN(i) 

(I  +  aD  —  c'A)bh  =  (|)h  • 


4.4  Analysis  of  Convergence 

Here  we  study  the  sufficient,  but  not  necessary  conditions  for  which  our  method,  FaBP,  converges. 
The  implementation  details  of  FaBP  are  described  in  the  upcoming  Section  4.5.  Lemma  9, 
Lemma  10,  and  Equation  4.23  give  the  convergence  conditions. 

All  our  results  are  based  on  the  power  expansion  that  results  from  the  inversion  of  a  matrix  of 
the  form  I  —  W;  all  the  methods  undergo  this  process,  as  we  show  in  Table  4.3.  Specifically,  we 
need  the  inverse  of  the  matrix  I  +  aD  —  c'A  =  I  —  W,  which  is  given  the  expansion: 

(I-W)-1  =I  +  W  +  W2  +  W3  +  ...  (4.20) 

and  the  solution  of  the  linear  system  is  given  by  the  formula 


(I  -  W)_14>h  =  ct>h  +  W  •  4»n  +  W  •  (W  •  (|>h)  +  -  (4.21) 

This  method  is  fast,  since  the  computation  can  be  done  in  iterations,  each  one  of  which  consists  of 
a  sparse-matrix/vector  multiplication.  This  is  referred  to  as  the  Power  Method.  However,  the  power 
method  does  not  always  converge.  In  this  section  we  examine  its  convergence  conditions. 
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Lemma  9  (Largest  eigenvalue). 


OO  00 

The  series  ^  |W|k  =  ^  |c'A  —  aD|k  converges  i/f  A(W)  <  1,  where  A(W)  is  the  magni- 

k=0  k=0 

tude  of  the  largest  eigenvalue  of  W. 


Given  that  the  computation  of  the  largest  eigenvalue  is  non-trivial,  we  suggest  using  Lemma  10  or 
Lemma  11,  which  give  a  closed  form  for  computing  the  “about-half”  homophily  factor,  Hh. 


Lemma  10  (1-norm). 


The  series  ^  |W|k  =  ^  |c'A  —  aD|k  converges  if 

hn  < 


k=0 


k=0 


2(1  +  maxj  (djj)) 


(4.22) 


where  djj  are  the  elements  of  the  diagonal  matrix  D. 


Proof.  In  a  nutshell,  the  proof  is  based  on  the  fact  that  the  power  series  converges  if  the  1-norm, 
or  equivalently  the  oo-norm,  of  the  symmetric  matrix  W  is  smaller  than  1. 

Specifically,  in  order  for  the  power  series  to  converge,  a  sub-multiplicative  norm  of  matrix 
W  =  cA  —  aD  should  be  smaller  than  1.  In  this  analysis  we  use  the  1-norm  (or  equivalently  the 
oo-norm).  The  elements  of  matrix  W  are  either  c  =  or  — adu  =  -r-vvr1-  Thus,  we  require 


max 

j 

2h 


(^|Aij|)  <  1  =>•  (c  +  a)  -  max  dq  <  1 


V=1 


1  -2h  1 


max  djj  <  1  =¥■  h.H  < 


2(1  +maxj  djj)" 


Lemma  11  (Frobenius  norm). 


The  series  ^  |W|k  =  ^  |c'A  —  aD|k  converges  if 


k=0 


k=0 


hh  < 


\ 


Ci  +  yCj  4-  4c2 


8c2 


(4.23) 


where  ci  =  2  +  ^  du  and  c2  =  d?t  —  1. 


Proof.  This  upper  bound  for  Hh  is  obtained  by  considering  the  Frobenius  norm  of  matrix  W  and 
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solving  the  inequality  ||  W  ||p= 


n  n 

,  L£iw,p<i  with  respect  to  h^.  ■ 

A|  i=l  3=1 


Equation  4.23  is  preferable  over  Equation  4.22  when  the  degrees  of  the  graph’s  nodes  demonstrate 
considerable  standard  deviation.  The  1-norm  yields  small  hh  for  very  big  values  of  the  highest 
degree,  while  the  Frobenius  norm  gives  a  higher  upper  bound  for  hjv.  Nevertheless,  we  should  bear 
in  mind  that  h)T  should  be  a  sufficiently  small  number  in  order  for  the  “about-half”  approximations 
to  hold. 


4.5  Proposed  Algorithm:  FaBP 

Based  on  the  analysis  in  Section  4.2  and  Section  4.4,  we  propose  the  FaBP  algorithm  (Algo¬ 
rithm  4.2). 


Algorithm  4.2  FaBP 

Input:  graph  G,  prior  beliefs  c(> 

Step  1:  Pick  h)T  to  achieve  convergence:  hn  =  max  (Equation  4.22,  Equation  4.23}  and  com¬ 
pute  the  parameters  a  and  c'  as  described  in  Theorem  1. 

Step  2:  Solve  the  linear  system  of  Equation  4.4.  Notice  that  all  the  quantities  involved  in  this 
equation  are  close  to  zero. 

RETURN:  beliefs  b  for  all  the  nodes 


We  conjecture  that  if  the  achieved  accuracy  is  not  sufficient,  the  results  of  FaBP  can  still  be 
a  good  starting  point  for  the  original,  iterative  BP  algorithm.  One  would  need  to  use  the  final 
beliefs  of  FaBP  as  the  prior  beliefs  of  BP,  and  run  a  few  iterations  of  BP  until  convergence.  In 
the  datasets  we  studied,  this  additional  step  was  not  required,  as  FaBP  achieves  equal  or  higher 
accuracy  than  BP,  while  being  faster. 


4.6  Experiments 

We  present  experimental  results  to  answer  the  following  questions: 

Ql:  How  accurate  is  FaBP? 

Q2:  Under  what  conditions  does  FaBP  converge? 

Q3:  How  sensitive  is  FaBP  to  the  values  of  h  and  (f1? 

Q4:  How  does  FaBP  scale  on  very  large  graphs  with  billions  of  nodes  and  edges? 
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Table  4.5:  FaBP:  Order  and  size  of  graphs. 


Dataset  Nodes  Edges 


YahooWeb 
Kronecker  1 
Kronecker  2 
Kronecker  3 
Kronecker  4 
DBLP 


1413511390 
177  147 
120552 
59049 
19  683 
37791 


6636  600779 
1 977  149  596 
1  145  744786 
282416200 
40  333  924 
170794 


The  graphs  we  used  in  our  experiments  are  summarized  in  Table  4.5.  To  answer  Q1  (accuracy), 
Q2  (convergence),  and  Q3  (sensitivity),  we  use  the  DBLP  dataset1  [GLF+09],  which  consists 
of  14,376  papers,  14,475  authors,  20  conferences,  and  8,920  terms.  A  small  portion  of  these 
nodes  are  manually  labeled  based  on  their  area  (Artificial  Intelligence,  Databases,  Data  Mining 
and  Information  Retrieval):  4057  authors,  100  papers,  and  all  the  conferences.  We  adapted 
the  labels  of  the  nodes  to  two  classes:  AI  (Artificial  Intelligence)  and  not  Al  (=  Databases,  Data 
Mining  and  Information  Retrieval).  In  each  trial,  we  run  FaBP  on  the  DBLP  network  where 
(1  —  p)%  =  (1  —  a)%  of  the  labels  of  the  papers  and  the  authors  have  been  discarded.  Then, 
we  test  the  classification  accuracy  on  the  nodes  whose  labels  were  discarded.  The  values  of  a 
and  p  are  0.1%,  0.2%,  0.3%,  0.4%,  0.5%  and  5%.  To  avoid  combinatorial  explosion,  we  consider 
{hh,  priors}  =  {±0.002,  ±0.001}  as  the  anchor  values,  and  then  we  vary  one  parameter  at  a  time. 
When  the  results  are  the  same  for  different  values  of  a%  =  p%,  due  to  lack  of  space,  we  randomly 
pick  the  plots  to  present. 

To  answer  Q4  (scalability),  we  use  the  real  YahooWeb  graph  and  synthetic  Kronecker  graph 
datasets.  YahooWeb  is  a  Web  graph  containing  1.4  billion  web  pages  and  6.6  billion  edges;  we 
label  11  million  educational  and  11  million  adult  web  pages.  We  use  90%  of  these  labeled  data  to 
set  node  priors,  and  use  the  remaining  10%  to  evaluate  the  accuracy.  For  parameters,  we  set  Hh  to 
0.001  using  Lemma  11  (Frobenius  norm),  and  the  magnitude  of  the  prior  beliefs  to  0.5  ±  0.001. 
The  Kronecker  graphs  are  synthetic  graphs  generated  by  the  Kronecker  generator  [  ] . 


4.6.1  Ql:  Accuracy 

Figure  4.3  shows  the  scatter  plots  of  beliefs  (FaBP  vs.  BP)  for  each  node  of  the  DBLP  data.  We 
observe  that  FaBP  and  BP  result  in  practically  the  same  beliefs  for  all  the  nodes  in  the  graph, 
when  run  with  the  same  parameters,  and  thus,  they  yield  the  same  accuracy.  Conclusions  are 
identical  for  any  labeled-set-size  we  tried  (0.1%  and  0.3%  shown  in  Figure  4.3). 

Observation  6.  FaBP  and  BP  agree  on  the  classification  of  the  nodes  when  run  with  the  same 
parameters. 

1http://web. engr.illinois.edu/  mingjil/DBLP_four_area.zip 


68 


Scatter  plot  of  beliefs  for  (h,  priors)  =  (0.5+/-0.0020,  0.5+/-0.001) 


Figure  4.3:  The  quality  of  scores  of  FaBP  is  near-identical  to  BP,  i.e.  all  on  the  45-degree  line  in 
the  scatter  plot  of  beliefs  (FaBP  vs  BP)  for  each  node  of  the  DBLP  sub-network;  red/green  points 
correspond  to  nodes  classified  as  “AI/not-AI”  respectively. 

4.6.2  Q2:  Convergence 

We  examine  how  the  value  of  the  “about-half”  homophily  factor  affects  the  convergence  of  FaBP. 
Figure  4.4  the  red  line  annotated  with  “max  |eval|  =  1”  splits  the  plots  into  two  regions;  (a)  on 
the  left,  the  Power  Method  converges  and  FaBP  is  accurate,  (b)  on  the  right,  the  Power  Method 
diverges  resulting  in  a  significant  drop  in  the  classification  accuracy.  We  annotate  the  number 
of  classified  nodes  for  the  values  of  Kh  that  leave  some  nodes  unclassified  because  of  numerical 
representation  issues.  The  low  accuracy  scores  for  the  smallest  values  of  Kh  are  due  to  the 
unclassified  nodes,  which  are  counted  as  misclassifications.  The  Frobenius  norm-based  method 
yields  greater  upper  bound  for  hh  than  the  1 -norm-based  method,  preventing  any  numerical 
representation  problems. 

Observation  7.  Our  convergence  bounds  consistently  coincide  with  high-accuracy  regions.  Thus, 
we  recommend  choosing  the  homophily  factor  based  on  the  Frobenius  norm  using  Equation  4.23. 


4.6.3  Q3:  Sensitivity  to  parameters 

Figure  4.4  shows  that  FaBP  is  insensitive  to  the  “about-half”  homophily  factor,  h.|X,  as  long  as 
the  latter  is  within  the  convergence  bounds.  In  Figure  4.5  we  observe  that  the  accuracy  score  is 
insensitive  to  the  magnitude  of  the  prior  beliefs.  We  only  show  the  cases  a,p  €  {0.1%,  0.3%,  0.5%}, 
as  for  all  values  except  for  a,p  =  5.0%,  the  accuracy  is  practically  identical.  The  results  are  similar 
for  different  “about-half”  homophily  factors. 

Observation  8.  The  accuracy  results  are  insensitive  to  the  magnitude  of  the  prior  beliefs  and 
homophily  factor  -  as  long  as  the  latter  is  within  the  convergence  bounds  we  gave  in  Section  4.4. 
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Accuracy  with  respect  to  hh  (prior  beliefs  =  +/-  0.00100) 


a%  =  p%  =  0.1%  labels 


a%  =  p%  =  0.5%  labels 


a%  =  p%  =  0.3%  labels 
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Figure  4.4:  FaBP  achieves  maximum  accuracy  within  the  convergence  bounds.  If  not  all  nodes  are 
classified  by  FaBP,  we  give  the  number  of  classified  nodes  in  red. 
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Accuracy  with  respect  to  the  magnitude  of 
the  prior  beliefs  (hh  =  +/-  0.00200) 


10 


10 


10 


10 


prior  beliefs'  magnitude 


Figure  4.5:  Insensitivity  of  FaBP  to  the  magni-  Figure  4.6:  Running  Time  of  FaBP  vs.  #  edges 
tude  of  the  prior  beliefs.  for  10  and  30  machines  on  Hadoop.  Kronecker 

graphs  are  used. 
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4.6.4  Q4:  Scalability 


To  show  the  scalability  of  FaBP,  we  implemented  FaBP  on  Hadoop,  an  open  source  MapReduce 
framework  which  has  been  successfully  used  for  large-scale  graph  analysis  [  ] .  We  first  show 

the  scalability  of  FaBP  on  the  number  of  edges  of  Kronecker  graphs.  As  seen  in  Figure  4.6,  FaBP 
scales  linear  on  the  number  of  edges.  Next,  we  compare  the  Hadoop  implementation  of  FaBP 
and  BP  [  ]  in  terms  of  running  time  and  accuracy  on  the  YahooWeb  graph.  Figure  4.7(a)-(c) 

shows  that  FaBP  achieves  the  maximum  accuracy  after  two  iterations  of  the  Power  Method,  and  is 
~  2  x  faster  than  BP. 


Observation  9.  Our  FaBP  implementation  is  linear  on  the  number  of  edges,  with  ~  2x  faster 
running  time  than  BP  on  Hadoop. 


(c)  Accuracy  vs  runtime 


Figure  4.7:  Performance  on  the  YahooWeb  graph  (best  viewed  in  color):  FaBP  wins  on  speed  and 
wins/ties  on  accuracy.  In  (c),  each  of  the  method  contains  4  points  which  correspond  to  the  number 
of  steps  from  1  to  4.  Notice  that  FaBP  achieves  the  maximum  accuracy  after  84  minutes,  while  BP 
achieves  the  same  accuracy  after  151  minutes. 
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4.7  Summary 

Which  of  the  many  guilt-by-association  methods  should  one  use?  In  this  chapter  we  have  an¬ 
swered  this  question,  and  presented  FaBP,  a  new,  fast  algorithm  to  do  such  computations.  The 
contributions  of  our  work  are  the  following: 

•  Theory  and  Correspondences:  We  have  shown  that  successful,  major  guilt-by-association 
approaches  (RWR,  SSL,  and  BP  variants)  are  closely  related,  and  we  have  proved  that  some 
are  even  equivalent  under  certain  conditions  (Theorem  1,  Lemma  1,  Lemma  2,  Lemma  3). 

•  Effective  and  Scalable  Algorithm:  Thanks  to  our  analysis,  we  have  designed  FaBP,  a  fast 
and  accurate  approximation  to  the  standard  belief  propagation  (BP),  which  has  convergence 
guarantees  (Lemma  10  and  Lemma  11). 

•  Experiments  on  Real  Graphs:  We  have  shown  that  FaBP  is  significantly  faster,  about  2x, 
and  it  has  the  same  or  better  accuracy  (Area  Under  Curve,  or  AUC)  than  BP.  Moreover,  we 
have  shown  how  to  parallelize  it  with  MapReduce  (Hadoop),  operating  on  billion-node 
graphs. 

Thanks  to  our  analysis,  our  guide  to  practitioners  is  the  following:  out  of  the  three  guilt-by- 
association  methods,  we  recommend  belief  propagation,  for  two  reasons:  (1)  it  has  solid,  Bayesian 
underpinnings  and  (2)  it  can  naturally  handle  heterophily,  as  well  as  multiple  class-labels.  With 
respect  to  parameter  settings,  we  recommend  to  choose  the  homophily  score,  hjv,  according  to  the 
Frobenius  bound  (Equation  4.23). 

Can  FaBP  be  extended  to  handle  multi-class  labels?  Yes,  but  to  handle  multiple  labels,  the 
derivations  of  FaBP  must  be  redone,  because  the  ratio  approach  that  we  used  does  not  work  for 
more  than  two  classes.  We  elaborate  on  this  topic  in  the  following  chapter. 
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Chapter  5 

Inference  in  a  Graph:  More  Classes 


How  can  we  tell  whether  accounts  are  real  or  fake  in  a  social  network?  And  how  can  we  tell 
which  accounts  belong  to  liberal,  conservative  or  centrist  users?  As  we  said  in  Chapter  4,  we 
can  often  answer  such  questions  and  label  nodes  in  a  network  based  on  the  labels  of  their 
neighbors  and  appropriate  assumptions  of  homophily  (“birds  of  a  feather  flock  together”)  or 
heterophily  (“opposites  attract”) .  In  Chapter  4  we  have  shown  the  equivalence  of  three  guilt- 
by-association  approaches  (Random  Walk  with  Restarts  RWR,  Semi-Supervised  Learning  SSL, 
and  Belief  Propagation  BP)  that  can  be  used  for  transductive  learning,  and  introduced  a  fast 
approximation  of  BP  called  FaBP,  which  can  be  used  for  binary  classification  in  networked  data. 

In  this  chapter,  we  focus  on  BP,  which  iteratively  propagates  the  information  from  a  few  nodes 
with  explicit  initial  labels  throughout  the  network  until  convergence.  Here,  we  not  only  cover 
the  popular  cases  with  k=2  classes  (e.g.,  talkative  vs.  silent  people  in  an  online  dating  site), 
but  also  capture  more  general  settings  that  mix  homophily  and  heterophily.  We  illustrate  with 
an  example  taken  from  online  auction  settings  like  eBay  [  ’CWF07]:  We  observe  k=3  classes  of 
people:  fraudsters  (F),  accomplices  (A)  and  honest  people  (H).  Honest  people  buy  and  sell  from 
other  honest  people,  as  well  as  accomplices.  Accomplices  establish  a  good  reputation  (thanks  to 
multiple  interactions  with  honest  people),  they  never  interact  with  other  accomplices  (waste  of 
effort  and  money),  but  they  do  interact  with  fraudsters,  forming  near-bipartite  cores  between  the 
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Figure  5.1:  Three  types  of  network  effects  with  example  coupling  matrices.  Color  intensity  cor¬ 
responds  to  the  coupling  strengths  between  classes  of  neighboring  nodes,  (a):  D:  Democrats,  R: 
Republicans,  (b):  T:  Talkative,  S:  Silent,  (c):  H:  Honest,  A:  Accomplice,  F:  Fraudster. 
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two  classes.  Fraudsters  primarily  interact  with  accomplices  (to  build  reputation);  the  interaction 
with  honest  people  (to  defraud  them)  happens  in  the  last  few  days  before  the  fraudster’s  account 
is  shut  down. 

Thus,  in  general,  we  can  have  k  different  classes,  and  k  x  k  affinities  or  coupling  strengths 
between  them.  These  affinities  can  be  organized  in  a  coupling  matrix,  which  we  call  propagation 
matrix.  In  this  work,  we  assume  the  heterophily  matrix  to  be  given;  e.g.,  by  domain  experts. 
Learning  heterophily  from  existing  partially  labeled  data  is  interesting  future  work  (see  [  jatl4]  for 
initial  results).  In  Figure  5.1  we  give  the  coupling  matrices  for  three  examples  of  network  effects. 
Figure  5.1(a)  shows  the  matrix  for  homophily;  it  captures  that  a  connection  between  people  with 
similar  political  orientations  is  more  likely  than  between  people  with  different  orientations.  An 
example  of  homophily  with  k=4  classes  would  be  in  a  co-authorship  network:  Researchers  in 
computer  science,  physics,  chemistry  and  biology  tend  to  publish  with  co-authors  of  similar  training. 
Efficient  labeling  in  case  of  homophily  is  possible;  e.g.,  by  simple  relational  learners  [  VIP07] . 
Figure  5.1(b)  captures  our  example  for  heterophily;  class  T  is  more  likely  to  date  members  of 
class  S,  and  vice  versa.  Finally,  Figure  5.1(c)  shows  a  more  general  example:  We  see  homophily 
between  members  of  class  H  and  heterophily  between  members  of  classes  A  and  F. 

In  all  of  the  above  scenarios,  we  are  interested  in  the  most  likely  “beliefs”  (or  labels)  for  all 
nodes  in  the  graph  by  using  BP.  The  underlying  problem  is  then:  How  can  we  assign  class  labels 
when  we  know  who-contacts-whom  and  the  a  priori  (“initial”  or  “explicit”)  labels  for  some  of  the 
nodes  in  the  network ?  How  can  we  handle  multiple  class  labels,  as  well  as  intricate  network  effects? 
One  of  the  main  challenges  is  that  BP  has  well-known  convergence  problems  in  graphs  with  loops 
(see  [SNB+08]  for  a  detailed  discussion  from  a  practitioner’s  point  of  view).  While  there  is  a 
lot  of  work  on  the  convergence  of  BP  (e.g.,  [  MK06,  MK07]),  exact  criteria  for  convergence  are 
not  known  [Murl2,  Sec.  22].  This  issue  raises  one  more  fundamental  theoretical  question  of 
practical  importance:  How  can  we  find  the  sufficient  and  necessary  conditions  for  the  convergence  of 
the  algorithm ? 

In  this  chapter  we  introduce  LinBP,  a  new  formulation  of  BP  for  inference  with  multiple  class 
labels,  which,  unlike  standard  BP,  (i)  comes  with  exact  convergence  guarantees,  (ii)  allows 
closed-form  solutions,  and  (ILL)  gives  a  clear  intuition  about  the  algorithms.  In  more  detail,  our 
contributions  are: 

•  Problem  Formulation:  We  give  a  new  matrix  formulation  for  multi-class  BP  called  Lin¬ 
earized  Belief  Propagation  (LinBP)  (Section  5.2). 

•  Theory:  We  prove  that  LinBP  results  from  applying  a  certain  linearization  process  to  the 
update  equations  of  BP  (Section  5.3),  and  show  that  the  solution  to  LinBP  can  be  obtained 
in  a  closed  form  by  the  inversion  of  an  appropriate  Kronecker  product  (Section  5.3.2).  We 
also  show  that  this  new  closed-form  provides  us  with  exact  convergence  guarantees  (even  on 
graphs  with  loops)  and  a  clear  intuition  about  the  reasons  for  convergence/non-convergence 
(Section  5.4.1). 

•  Experiments  on  Real  Graphs:  We  show  that  a  main-memory  implementation  of  LinBP 
is  faster  than  standard  BP,  while  giving  almost  identical  classifications  (>  99.9%  overlap, 
Section  5.6). 
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We  gave  an  intoduction  to  BP  in  Chapter  4.  In  Section  5.1  we  remind  the  reader  of  the  main 
equations  for  BP  and  give  some  additional  background  for  LinBP.  In  Section  5.2  we  introduce 
the  LinBP  matrix  formulation,  and  in  Section  5.3  we  sketch  its  derivation.  Section  5.4  provides 
convergence  guarantees,  and  extends  LinBP  to  weighted  graphs.  Section  5.5  presents  the  equiva¬ 
lence  between  LinBP  and  FaBP  presented  in  Chapter  4.  We  give  experiments,  related  work,  and  a 
summary  of  our  work  in  Sections  5.6,  5.7,  and  5.8,  respectively. 


5.1  Belief  Propagation  for  Multiple  Classes 

As  we  have  “mentioned  in  Chapter  4,  Belief  Propagation  (BP)  is  an  efficient  inference  algorithm 
on  probabilistic  graphical  models  that  can  be  used  for  transductive  inference.  For  more  details,  we 
advise  the  reader  to  see  Section  4.1.3. 

In  our  scenario,  we  are  interested  in  the  most  likely  beliefs  (or  classes)  for  all  nodes  in  a 
network.  BP  helps  to  iteratively  propagate  the  information  from  a  few  nodes  with  explicit  beliefs 
throughout  the  network.  More  formally,  consider  a  graph  of  n  nodes  and  k  possible  classes.  Each 
node  maintains  a  k-dimensional  belief  vector  where  each  element  i  represents  a  weight  proportional 
to  the  belief  that  this  node  belongs  to  class  i.  We  denote  by  <)>s  the  vector  of  prior  (or  explicit  or 
initial )  beliefs  and  bs  the  vector  of  posterior  (or  implicit  or  final)  beliefs  at  node  s,  and  require  that 
c|>s  and  bs  are  normalized  to  1;  i.e.  Y.i  esM  =  Y.i  t>s (i)  =  l.1  Using  mst  for  the  k-dimensional 
message  that  node  s  sends  to  node  t,  we  can  write  the  BP  update  formulas  [Mur  12,  WeiOO]  for  the 
belief  vector  of  each  node  and  the  messages  it  sends  w.r.t.  class  i  as: 

bs(i)  «-  rnus(i) 

s  ueN(s) 

mst(i)  <-  ^Hst(j,i)  4>s(j)  mus ( j ) 

j  u€N(s)\t 

where  Zs  is  a  normalization  constant  that  makes  the  elements  of  bs  sum  up  to  1,  and  Hst(j,i)  is 
a  proportional  “coupling  weight”  that  indicates  the  relative  influence  of  class  j  of  node  s  on  class  i 
of  node  t  (Figure  5.1).  Equations  5. 1-5.2  are  generalizations  of  Equations  4.2-4.3  in  Chapter  4. 
We  note  that  these  are  update  formulas,  while  the  equations  we  derive  next  are  true  equations. 
We  chose  the  symbol  H  for  the  coupling  weights  as  a  reminder  of  our  motivating  concepts  of 
homophily  and  heterophily.  Concretely,  if  H(i,  i)  >  H(j,  i)  for  all  i  ^  j,  we  say  homophily  is  present. 
If  the  opposite  inequality  is  true  for  all  i,  then  heterophily  is  present.  Otherwise,  there  exists 
homophily  for  some  classes,  and  heterophily  for  others.  For  reference,  in  Chapter  4  we  used  h  for 
the  homophily  factor  (constant).  Similarly  to  our  previous  analysis,  we  assume  that  the  relative 
coupling  between  classes  is  the  same  in  the  whole  graph;  i.e.,  is  identical  for  all  edges  in 

the  graph.  We  further  require  this  coupling  matrix  H  to  be  doubly  stochastic  and  symmetric:  (i) 

1  We  note  that  we  write  Y.i  as  a  short  form  for  ^7ie|k  whenever  k  is  clear  from  the  context. 
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(5.1) 

(5.2) 


Double  stochasticity  is  a  necessary  requirement  for  our  mathematical  derivation2,  (ii)  Symmetry 
is  not  required,  but  follows  from  our  assumption  of  undirected  edges. 

The  goal  in  this  chapter  is  to  find  the  top  beliefs  for  each  node  in  the  network,  and  to  assign 
these  beliefs  to  the  respective  nodes.  That  is,  for  each  node  s,  we  are  interested  in  determining  the 
classes  with  the  highest  values  in  bs. 

Problem  Definition  4.  [Top  belief  assignment] 

•  Given:  (1)  an  undirected  graph  with  n  nodes  and  adjacency  matrix  A, 

(2)  a  symmetric,  doubly  stochastic  coupling  k  x  k  matrix  H,  where  H(j,i)  indicates  the 
relative  influence  of  class  j  of  a  node  on  class  i  of  its  neighbor, 

(3)  a  matrix  of  explicit  beliefs  <D,  where  0(s,  ij  f  0  is  the  belief  in  class  i  by  node  s 

•  Find:  for  each  node  a  set  of  classes  with  the  highest  final  belief  (i.e.,  top  belief  assignment). 

So,  our  problem  is  to  find  a  mapping  from  nodes  to  sets  of  classes.  Table  5.1  gives  the  notation  that 
we  use  in  this  chapter,  which  is  a  superset  of  the  symbols  we  gave  in  Chapter  4. 


5.2  Proposed  Method:  Linearized  Belief  Propagation 

In  this  section,  we  introduce  Linearized  Belief  Propagation  (LinBP),  which  is  a  closed-form  descrip¬ 
tion  for  the  final  beliefs  after  the  convergence  of  BP  under  mild  restrictions  of  our  parameters.  The 
main  idea  is  to  center  the  values  around  default  values  (using  Maclaurin  series  expansions)  and  to 
then  restrict  our  parameters  to  small  deviations  from  these  defaults.  The  resulting  equations  re¬ 
place  multiplication  with  addition  and  can  thus  be  put  into  a  matrix  framework  with  a  closed-form 
solution.  This  allows  us  to  later  give  the  exact  convergence  criteria  based  on  problem  parameters. 

Definition  1 .  [Centering]  We  call  a  vector  or  matrix  x  “centered  around  c”  if  all  its  entries  are  close 
to  c  and  their  average  is  exactly  c. 

Definition  2.  [Residual  vector/matrix]  If  a  vector  x  is  centered  around  c,  then  the  residual  vector 
around  c  is  defined  as  x  =  [xi  —  c,  X2  —  c, . . .] .  Accordingly,  we  denote  a  matrix  X  as  a  residual 
matrix  if  each  column  and  row  vector  corresponds  to  a  residual  vector. 

For  example,  we  call  the  vector  x  =  [1.01, 1.02,0.97]  centered  around  c  =  l.3  The  residuals  from 
c  will  form  the  residual  vector  x  =  [0.01,0.02,  —0.03].  Notice  that  the  entries  in  a  residual  vector 
always  sum  up  to  0,  by  construction. 

The  key  ideas  in  our  proofs  are  as  follows: 

1.  The  k-dimensional  message  vectors  m  are  centered  around  1. 

2We  note  that  single-stochasticity  could  easily  be  constructed  by  taking  any  set  of  vectors  of  relative 
coupling  strengths  between  neighboring  classes,  normalizing  them  to  1,  and  arranging  them  in  a  matrix. 

3All  vectors  x  in  this  chapter  are  assumed  to  be  column  vectors  [xi,  x2, .  •  -]T  even  if  written  as  row  vectors 
[xi,x2, ...]. 
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2.  All  the  other  k-dimensional  vectors  are  probability  vectors;  they  have  to  sum  up  to  1,  and 
thus  they  are  centered  around  1/k.  This  holds  for  the  belief  vectors  b,  (J>,  and  for  the  all 
entries  of  matrix  H. 

3.  We  make  use  of  each  of  the  two  linearizing  approximations  shown  in  Table  5.2  exactly  once. 

According  to  key  idea  1,  we  require  that  the  messages  sent  are  normalized  so  that  the  average 
value  of  the  elements  of  a  message  vector  is  1  or,  equivalently,  that  the  elements  sum  up  to  k: 

TTtst(i-)  <r-  Y  H(M)  <Mj)  Y1  mus9)  (5-3) 

st  )  ueN(s)\t 

Here,  we  write  Zst  as  a  normalization  constant  that  makes  the  elements  of  mst  sum  up  to  k. 
Scaling  all  elements  of  a  message  vector  by  the  same  constant  does  not  affect  the  resulting  beliefs, 
since  the  normalization  constant  in  Equation  5.1  makes  sure  that  the  beliefs  are  always  normalized 
to  1,  independent  of  the  scaling  of  the  messages.  Thus,  scaling  messages  still  preserves  the  exact 
solution,  yet  it  will  be  essential  for  our  derivation. 


Table  5.1:  LinBP:  Major  symbols  and  descriptions. 


Symbol  Description 


n 

A 

D 

Ik 

s,t,u 

N(s) 

k 

M>9 

£h 

bs 

mst 

<D,B 

H 

Hh,  ®h,BH 

Hh|o 


number  of  nodes 

n  x  n  weighted  symmetric  adjacency  matrix 
n  x  n  diagonal  degree  matrix 
k  x  k  identity  matrix 
indices  used  for  nodes 
set  of  neighbors  for  node  s 

number  of  classes 
indices  used  for  classes 
scaling  factor 

k  x  1  prior  (explicit,  initial)  belief  vector  at  node  s 
k  x  1  posterior  (implicit,  final)  belief  vector  at  node  s 
k  x  1  message  vector  from  node  s  to  node  t 

n  x  k  explicit  or  implicit  belief  matrix  with  <D  (s,  i)  indicating  the  belief  in  class  i  by 
node  s 

k  x  k  coupling  matrix  with  H(j,  i)  indicating  the  influence  of  class  j  of  a  sender  on 

class  i  of  the  recipient 

residual  matrices  centered  around  £ 

unsealed,  original  coupling  matrices  Hh  =  eHHh|0 


ve  c  (X)  vectorization  of  matrix  X 

X  0  Y  Kronecker  product  between  matrices  X  and  Y 
p(X)  spectral  radius  of  a  matrix  X 
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Table  5.2:  Logarithm  and  division  approximations  used  in  our  derivation. 


Formula 

Maclaurin  series 

Approx. 

Logarithm 

Division 

ln(l  +  e) 

ij+ei 

1  +  62 

—  _ 

t  2  '  3 

=  +  —  £2  +  ^2”  •  •  ' 

~  £ 

)  ~k  +  ei-t 

Theorem  1  (Linearized  BP  (LinBP)). 


Let  B|x  and  Oh  be  the  residual  matrices  of  final  and  explicit  beliefs  centered  around 
1  /k,  Hh  the  residual  coupling  matrix  centered  around  1  /k,  A  the  adjacency  matrix,  and 
D  =  diagfd )  the  diagonal  degree  matrix.  Then,  the  final  belief  assignment  from  belief 
propagation  is  approximated  by  the  equation  system: 


Bh  =  +  ABhHh 


DBhH 


HnH 


{LinBP) 


(5.4) 


Proof.  We  give  the  proof  in  Section  5.3.1.  ■ 

Figure  5.2  illustrates  Equation  5.4  and  shows  our  matrix  conventions.  We  refer  to  the  term 
DBhHh2  as  “echo  cancellation”.  This  effect  exists  in  the  original  BP  update  equations:  The  message 
sent  across  an  edge  excludes  the  information  that  was  received  in  the  previous  iteration  through 
the  same  edge,  but  in  the  opposite  direction  (“u  e  N(s)\t”  in  Equation  5.2).  In  a  probabilistic 
scenario  on  tree-based  graphs,  this  term  is  required  for  correctness.  In  loopy  graphs  (without  well- 
justified  semantics),  this  term  still  compensates  for  two  neighboring  nodes  building  up  each  other’s 
scores.  For  increasingly  small  residuals,  the  echo  cancellation  becomes  increasingly  negligible,  and 
by  further  ignoring  it.  Equation  5.4  can  be  simplified  to 

BH  =  <Dh  +  ABHHh  {LinBP*)  (5.5) 

We  will  refer  to  Equation  5.4  (with  echo  cancellation)  as  LinBP  and  Equation  5.5  (without  echo 
cancellation)  as  LinBP*. 


Bh  —  Oh  +  A  Bh  Hh  —  D  Bh  Hh2 

Figure  5.2:  Matrix  conventions:  H(j,i)  indicates  the  relative  influence  of  class  j  of  a  node  on  class  i 
of  its  neighbor,  A  is  the  adjacency  matrix,  and  B(s,i)  is  the  belief  in  class  i  by  node  s. 

Iterative  updates.  We  note  that  these  equations  give  an  implicit  definition  of  the  final  beliefs 
after  convergence,  and  eventually  a  closed  formula  (Section  5.3.2).  One  of  the  many  ways  to  solve 
Equations  5.4  and  5.5  is  by  iteration:  Starting  with  an  arbitrary  initialization  of  Bh  (e.g.,  all  values 
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zero),  we  repeatedly  compute  the  right  hand  side  of  the  equations  and  update  the  values  of  Bn 
until  the  process  converges: 

Bh(£+1)  ®H  +  ABH(£)Hh  -  DBh(£)Hh2  ( LinBP )  (5.6) 

Bh(£+1)  ^0H  +  ABH(£)Hh  (LinBP*)  (5.7) 

Thus,  the  final  beliefs  of  each  node  can  be  computed  via  elegant  matrix  operations  and  optimized 
solvers,  while  the  implicit  form  gives  us  guarantees  for  the  convergence  of  this  process,  as  explained 
in  Section  5.4.1.  Also  notice  that  our  update  equations  calculate  beliefs  directly  (i.e.,  without 
having  to  calculate  the  messages  first);  this  will  give  us  significant  performance  improvements  as 
our  experiments  will  later  show. 


5.3  Derivation  of  LinBP 

This  section  sketches  the  proofs  of  our  first  technical  contribution:  Section  5.3.1  linearizes  the 
update  equations  of  BP  by  centering  around  appropriate  defaults  and  using  the  approximations 
from  Table  5.2  (Lemma  1),  and  then  expressing  the  steady  state  messages  in  terms  of  beliefs 
(Lemma  2).  Section  5.3.2  gives  an  additional  closed-form  expression  for  the  steady-state  beliefs 
(Lemma  3). 


5.3. 1  Centering  Belief  Propagation 

We  derive  our  formalism  by  centering  the  elements  of  the  coupling  matrix  and  all  message  and 
belief  vectors  around  their  natural  default  values;  i.e.,  the  elements  of  m  around  1,  and  the  elements 
of  H,  4>,  and  b  around  We  are  interested  in  the  residual  values  given  by:  m(i)  =  1  + 

H(j,i)  =  y  +  Hh(j,i),  e(i)  =  £  +  4>H|(i)>  and  b(i)  =  £  +  bK| (i) 4  As  a  consequence,  Hh  €  Mkxk  is 
the  residual  coupling  matrix  that  makes  explicit  the  relative  attraction  and  repulsion —  the  sign  of 
Hn(j,  i)  tells  us  if  the  class  )  attracts  or  repels  class  i  in  a  neighbor,  and  the  magnitude  of  Hn(j,  i) 
indicates  the  strength.  Subsequently,  this  centering  allows  us  to  rewrite  belief  propagation  in 
terms  of  the  residuals. 


4We  call  these  default  values  “natural”  as  our  results  imply  that  if  we  start  with  centered  messages 
around  1  and  set  -j—  =  k,  then  the  derived  messages  with  Equation  5.3  remain  centered  around  1/or  any 
iteration.  Also  we  note  that  multiplying  with  a  message  vector  with  all  entries  1  does  not  change  anything. 
Similarly,  a  prior  belief  vector  with  all  entries  £  gives  equal  weight  to  each  class.  Finally,  we  call  unodes  with 
explicit  beliefs ”,  those  nodes  for  which  the  residuals  have  non-zero  elements  (4>H  ^  Ok),  that  is,  the  explicit 
beliefs  deviate  from  the  center 
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Lemma  1  (Centered  BP). 


By  centering  the  coupling  matrix,  beliefs  and  messages,  the  update  equations  for  belief 
propagation  can  be  approximated  by: 


f)h.|s(f)  ^  4*h.|s(f)  "h  ^  y  TU-HIus  (f) 

uEN  (s ) 

T^hlst(f)  «-  HK(j,i)bh|s(j)  -  Y_  HK(ht)mH|ts(j) 


(5.8) 

(5.9) 


Proof.  Substituting  the  expansions  into  the  belief  updates  Equation  5.1  leads  to 

^  +  bh.|s(f)  '  H  (^  +  mH|us(f)) 


uEN  (s ) 

In  (l  +  kbH|s(i))  < - lnZs  +  In  (l  +  k(J)H|s(i))  +  ^  In  (l  +  m^usM) 

uEN  (s) 

We  then  use  the  approximation  ln(l  +  e)  «  e  for  small  e: 

kb^istt)  < - lnZs  +  M>H|s(i.)  +  Y_  m.h|us(i) 

uEN  (s) 


(5.10) 


Summing  both  sides  over  i  gives  us: 

- klnZs +k^4>h|s(i)  +  ^  Y-  ^h|us  M 

i  _ ^  1  _ ^  i  uEN(s) 


Hence,  In  Zs  needs  to  be  0,  and  therefore  our  normalizer  is  actually  a  normalization  constant  and 
independent  for  all  nodes  Zs  =  1.  Plugging  Zs  =  1  back  into  Equation  5.10  leads  to  Equation  5.8: 

klx|sW  ^  4*H|sW  ^  y  ^H|usW 

ilGN  (s) 


To  obtain  Equation  5.9,  we  first  write  Equation  5.3  as: 

TTtst(t)  <-  ^-^_H(j,i)  phis(j)  Y[  (5.11) 

St  j  uEN  (s)\t 

Zs  V-ur.  ..Z7PHts(j)  r[ueN(s)musO) 

<-  7 2_  K0>'lJ - 7TT - 

Zst  ~~  Tn-ts(l) 

(5.12) 
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(5.13) 


A 

Zst 


bs(j) 

TTlts(j) 


Then,  using,  Zs  =  1  and  the  expansions  leads  to: 


1  +  irih|st (i)  +- 


y  +  bH|s(j) 

1  +  tn-HitsO) 


For  the  right  side,  we  then  use  the  approximation 


1  +  62 


«  d.  +  ei  —  ie2  for  small  ei,  £2: 


zb  (t  +  *>•<>•  l>)  (t  +  bw<«>  -  S"Wi>) 

j 

^  Y  Hh.(j>f)  +  ^J  Y-  bh|s(i)  +  X blh.(j>  f)hh|s  (j ) 


1 


Y  mH|ts(i)  Y  HH(j,t)mH|ts(j)) 

)  i 

' - v - ' 

=0 


(5.14) 


(5.15) 

(5.16) 


We  can  then  determine  the  normalization  factor  Zst  to  be  a  constant  as  well  (Zst  =  k  ')  by 
summing  both  sides  of  Equation  5.16  over  i  and  observing  that  Y_)  bh|s(j)  Xi  Hh.(j,i)  =  0,  since 
LiHh(j,i)  =  0: 


k  +Y_  tn-histW 


=0 


O  +  LLnKO.i^.O) 

i  j 

\ _ ^ _ X 

=0 


-LL 

*  j 

v - v - ' 

=0 


We  get  Equation  5.9  from  Equation  5.16  and  =  k. 

m-histW  k^  Hh(j,i)bh|s(j)  -  Y  HH(j,i)mh|ts(j) 

)  i 

Using  Lemma  1,  we  can  derive  a  closed-form  description  for  the  steady-state  of  BP. 
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Lemma  2  (Steady  state  messages). 


For  small  deltas  of  all  values  from  their  expected  values,  and  after  the  convergence  of 
belief  propagation,  message  propagation  can  be  expressed  in  terms  of  the  steady  beliefs 
as: 


mHst  =  h(Iic  —  Hh  )  Hh(bh,s  —  Hhbht) 

where  1^  is  the  kxk  identity  matrix. 


(5.17) 


Proof.  Using  Equation  5.9  and  plugging  for  m^tsU)  back  into  the  equation  for  m^stO),  we  get: 

™-H|stW  <-  k^Hh(j,i)bh|s(j)  -^HH(j,i)  • 

)  ) 

(k^  Hh(g,  j)bH|t(g)  ~Y_  Hn(g,  j)mh|st(g)) 

9  9 

Now,  for  the  case  of  convergence,  both  mH|st(g)  on  the  left  and  right  side  of  the  equation  need  to 
be  equivalent.  We  can,  therefore,  group  related  terms  together  and  replace  the  update  symbol 
with  equality: 


™h|  St  M-  L  Hh(j,i)  H(g,j)mh|st(g) 

)  9 

=  kY_ Hh(j,i)bh|sO)  -  icX  hh(m)  Y_  Hh(g,  j)bh|t(g)  (5.18) 

j  i  9 

This  equation  can  then  be  written  in  a  matrix  notation  as: 

(Ik  -  HH2)mHst  =  kHhbHs  -  kHH2bHt  (5.19) 

which  leads  to  Equation  5.17,  given  that  all  entries  of  H|X  «  £  and  thus  the  inverse  of  (1^  —  Hh2) 
always  exists.  ■ 


From  Lemma  2,  we  can  now  prove  Theorem  1. 


Proof.  [Theorem  1  (Linearized  BP  /  LinBP)  ]  For  steady-state,  we  can  write  Equation  5.8  in  vector 
form: 


bh.s 


^Hs  + 


£  mhits 

ueN(s) 
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and  by  substituting  Hh*  for  (Ik  —  Hh2)  'Hh,  we  write  Equation  5.17  as 


mhUs  =  kHh*  (bhu  —  Hhbns ) 

Combining  the  last  two  equations,  we  get 

bhs  =  4*hs  Hh*  )  bhxt  dsHh*Hhbhs  (5.20) 

uGN  (s) 

where  ds  is  the  degree  or  number  of  bi-directional  neighbors  for  node  s,  i.e.  neighbors  that  are 
connected  to  s  with  edges  in  both  directions  (see  Section  5.4.2  for  a  discussion  of  the  implication). 
By  using  Bh  and  Oh,  as  n  x  k  matrices  of  final  and  initial  beliefs,  D  as  the  diagonal  degree  matrix, 
and  A  as  the  adjacency  matrix,  Equation  5.20  can  be  written  in  matrix  form 


Bh  =  Oh  +  ABhHh*  —  DBhHhHh*  (5.21) 

By  approximating  (Ik  —  Hh2)  ~  Ik  (recall  that  all  entries  of  Hh  «  and  thus  Hh*  ~  Hh,  we 
can  simplify  to 


Bh  =  d?h  +  AB^Hh  —  DB^Hh2 

And  by  further  ignoring  the  second  term  with  residual  terms  of  third  order,  we  can  further  simplify 
to  get  Equation  5.5.  ■ 


5.3.2  Closed-form  solution  for  LinBP 

In  practice,  we  will  solve  Equation  5.4  and  Equation  5.5  via  an  iterative  computation  (see  end 
of  Section  5.2).  However,  we  next  give  a  closed-form  solution,  which  will  later  allow  us  to  study 
the  convergence  of  the  iterative  updates.  We  need  to  introduce  two  new  notions:  Let  X  and  Y  be 
matrices  of  order  m  x  n  and  p  x  q,  respectively,  and  let  xj  denote  the  j-th  column  of  matrix  X; 
i.e.,  X  =  {xXj}  =  [xi . . .  xn].  First,  the  vectorization  of  matrix  X  stacks  the  columns  of  a  matrix  one 
underneath  the  other  to  form  a  single  column  vector;  i.e. 


vec(X) 


xi 

Xn. 
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Second,  the  Kronecker  product  of  X  and  Y  is  the  mp  x  nq  matrix  defined  by 


xnY 

X12Y  . . 

•  •  xlnY 

X2lY 

x22Y  .. 

•  •  x2nY 

^ml  Y 

xm2Y  . . 

^mnY 

Lemma  3  (Closed-form  LinBP). 


The  closed-form  solution  for  LinBP  (Equation  5.4)  is  given  by: 

vec(BH)  =  (Ink  —  Hh  0  A  +  Hh2  (g)  D)^1vec(<DH)  [LinBP) 


(5.22) 


Proof.  Roth’s  column  lemma  [HS81,  Rot34]  states  that 

vec  (XYZ)  =  (ZT0X)  vec(Y) 

With  HhT  =  Hh,  we  can  thus  write  Equation  5.4  as 

vec(BH)  =  vec(<Dn)  +  (Hh  0  A)vec(BH)  —  (Hh2  0  D)vec(BH) 

=  vec(fl>h)  +  (Hh  0  A  —  H^2  0  D) vec(Bn)  (5.23) 

which  can  be  solved  for  vec(Bn)  to  get  Equation  5.22.  ■ 

By  further  ignoring  the  echo  cancellation  Hh2  0  D,  we  get  the  closed-form  for  LinBP*  (Equa¬ 
tion  5.5)  as: 

vec(Bh)  =(Ink-Hh0A)-1vec(O)H)  [LinBP*)  (5.24) 

Thus,  by  using  Equation  5.22  or  Equation  5.24,  we  are  able  to  compute  the  final  beliefs  in  a 
closed-form,  as  long  as  the  inverse  of  the  matrix  exists.  In  the  next  section,  we  show  the  relation 
of  the  closed-form  to  our  original  update  equation  Equation  5.6  and  give  the  exact  convergence 
criteria. 


5.4  Additional  Benefits  of  LinBP 

In  this  section,  we  give  the  sujficient  and  necessary  convergence  criteria  for  LinBP  and  LinBP*,  and 
show  how  our  formalism  generalizes  to  weighted  graphs. 
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5.4.1  Update  equations  and  Convergence 

Equation  5.22  and  Equation  5.24  give  us  a  closed-form  for  the  final  beliefs  after  convergence.  From 
the  Jacobi  method  for  solving  linear  systems  [  3aa03],  we  know  that  the  solution  for  y  =  (I— M)_1x 
can  be  calculated,  under  certain  conditions,  via  the  iterative  update  equation 

y(l+1)<-x  +  My(!)  (5.25) 


These  updates  are  known  to  converge  for  any  choice  of  initial  values  for  y(0),  as  long  as  M  has  a 
spectral  radius  p(M)  <  l5.  Thus,  the  same  convergence  guarantees  carry  over  when  Equation  5.22 
and  Equation  5.24  are  written,  respectively,  as 

vec(BH(£+1))  ^  vec(Oh)  +  (Hh®  A-HH2®D)vec(Bh(£))  (5.26) 

vec(BH(£+1))  <-  vec(Oh)  +  (Hh  <g>  A)vec(BH(£))  (5.27) 


Furthermore,  it  follows  from  Lemma  3,  that  update  Equation  5.26  is  equivalent  to  our  original 
update  Equation  5.6,  and  thus  both  have  the  same  convergence  guarantees. 

We  are  now  ready  to  give  the  sufficient  and  necessary  criteria  for  the  convergence  of  the 
iterative  LinBP  and  LinBP*  update  equations. 


Lemma  4  (Exact  convergence). 

v _ y 

Necessary  and  sufficient  criteria  for  the  convergence  of  LinBP  and  LinBP*  are: 

LinBP  converges  <®>  p(Hh  ®  A  —  Hh2  <g>  D)  <  1  (5.28) 

LinBP*  converges  p(Hh)  <  (5.29) 


Proof.  From  the  Jacobi  method  for  solving  linear  systems  [Saa03],  we  know  that  the  update 
equation  Equation  5.26  converges  if  and  only  if  the  spectral  radius  of  the  matrix  is  smaller  than  1. 
Thus,  the  criterion  follows  immediately  for  LinBP  (Equation  5.22). 

For  Equation  5.24,  we  have  M  =  Hh  <g>  A  and  therefore  p(Hn  8  A)  =  p(Hn)p(A)  <  1,  which 
holds  if  and  only  if  p(Hn)  <  This  concludes  the  proof  for  LinBP*.  ■ 

We  note  that  Equation  5.28  is  an  implicit  criterion  for  Hh-  We  can  give  an  alternative 
explicit  sufficient  (but  not  necessary)  criterion  as  follows:  we  have  M  =  Hh  <8>  A  —  Hh2  ®  D 
and  therefore  p(Hh  <8>  A  —  Hh2  <g>  D)  ^  p(Hh  <g>  A)  4-  p(Hh2  <8>  D)  =  p(Hh)  p(A)  +  p(HH2)  p(D)  ^ 

p(Hh)  p(A)  +  p(HH)2  p(D)  <  1,  which  holds  if  p(Hh)  ® 

In  practice,  computation  of  the  largest  eigenvalues  can  be  expensive.  Instead,  we  can  exploit 
the  fact  that  any  norm  ||X||  gives  an  upper  bound  to  the  spectral  radius  of  a  matrix  X  to  establish 
sufficient  (but  not  necessary)  and  easier-to-compute  conditions  for  convergence. 

sThe  spectral  radius  p(-)  is  the  supremum  among  the  absolute  values  of  the  eigenvalues  of  the  enclosed 
matrix. 
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Lemma  5  (Sufficient  convergence). 

Let  ll  ll  stand  for  any  sub-multiplicative  norm  of  the  enclosed  matrix.  Then,  the  following 
are  sufficient  criteria  for  convergence: 

LinBP  converges  4=  ||Hh||  <  (5.30) 

LinBP*  converges  <=  ||HhII  <  pjj  (5.31) 

Further,  let  M.  be  a  set  of  such  norms  and  let  ||X||m  :=  min||  ||teM  l|X|k-  Then,  by 
replacing  each  ||  •  ||  with  ||  •  ||m>  we  get  better  bounds. 


Proof.  Since  p(X)  ^  ||X||,  it  is  sufficient  to  show  that  ||X||  <  1.  For  Equation  5.22,  we  have 
p(Hh  ®  A)  =  p(Hh)p(A)  ^  ||HhIIi  ||A||j  <  1,  which  holds  if  IlHnlk  <  jjxjp  We  note  that  different 
norms  ||  •  k  and  |  •  ||j  can  be  used,  and  the  best  bounds  can  be  achieved  for  minimizing  each  norm 
individually. 

For  Equation  5.24,  we  have  p(Hh®  A  — Hh2®D)  ^  p(HH)  p(A)  +  p(HH)2  p(D)  <  ||Hh|k  ||A||j  + 

||Hh||2  IlDHic  <  1,  which  holds  if  ||HhJk  ^  ^  2||D||^k  ^  Just  as  before,  we  can  use  different 

norms  ||  •  |k,  II  •  ||j,  and  ||  •  ||k,  and  we  get  the  best  bounds  for  minimizing  each  norm  individually.  ■ 


Vector/Elementwise  p-norms  for  p  e  [1,2]  (e.g.,  the  Frobenius  norm)  and  all  induced  p-norms 
are  sub-multiplicative6.  Furthermore,  vector  p-norms  are  monotonically  decreasing  for  increasing 
p,  and  thus:  p(X)  <  | |X| I2  ®  ||X|k  -  We  thus  suggest  using  the  following  set  M  of  three  norms  which 
are  all  fast  to  calculate:  (i)  Frobenius  norm,  (ii)  induced-1  norm,  and  (iii)  induced-00  norm. 

We  also  give  an  additional,  simpler,  yet  less  tight  sufficient  condition  for  the  convergence  of 
LinBP. 


Lemma  6  (Alternative  norm  criterion) . 


Let  ll  ll  stand  for  the  induced  1-norm  or  induced  oo-norm  of  the  enclosed  matrix.  Then 
LinBP  converges  if  ||HhII  < 


J 


Proof.  For  the  induced  1-norm  or  00-norm  (which  are  the  maximum  absolute  column  and  row 
sum  of  a  matrix,  respectively),  we  know  from  the  definition  of  D  that  ||D||  ^  ||A||.  With  ||Hh||2  < 
IIHhll  <  1,  we  thus  have  ||Hh||  ||A||  +  ||Hh||2  ||D||  <  2||Hh||  ||A||  <  1,  from  which  ||Hh||  <  ^.  ■ 

6  Vector  p-norms  are  defined  as  ||X||P  =  |X(i,  j)|p)1/p.  Induced  p-norms,  forp  =  1  andp  =  00, 

are  defined  ||X||i  =  maxj  Y_i  |X(i,  j)|  and  HXH^  =  maxi  |X(i,  j)\,  i.e.,  as  maximum  absolute  column  sum 
or  maximum  absolute  row  sum,  respectively. 
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5.4.2  Weighted  graphs 


Notice  that  Theorem  1  can  be  generalized  to  allow  weighted  graphs  by  simply  using  a  weighted 
adjacency  matrix  A  with  elements  A(i,  j)  =  w  >  0  if  the  edge  (j,i)  exists  with  weight  w,  and 
A(i,  j)  =  0  otherwise.  Our  derivation  remains  the  same,  we  only  need  to  make  sure  that  the  degree 
ds  of  a  node  s  is  the  sum  of  the  squared  weights  to  its  neighbors  (recall  that  the  echo  cancellation 
goes  back  and  forth).  The  weight  on  an  edge  simply  scales  the  coupling  strengths  between  two 
neighbors,  and  we  have  to  add  up  parallel  paths.  Thus,  Theorem  1  (LinBP)  can  be  applied  for 
weighted  graphs  as  well. 


5.5  Equivalence  to  FaBP  (k  =  2) 


Chapter  4  presented  a  linearization  for  belief  propagation  for  the  binary  case  (k  =  2) .  Here  we 
show  that  the  more  general  results  we  presented  in  this  chapter  include  the  binary  case  as  a  special 
case. 


We  start  from  Equation  5.19  and  use  the  normalization  conditions  for  k  =  2  to  write 


bn 


bhi 

— bh. 


mix  = 


Tn-H| 

-m-hl 


the  normalization  x(l)  =  — x 


and  Hh  = 

1)  holds  for  al 


h-H  —  tlh 
-Hh  Hh 


We  then  get  Hh2  =  2 


H2h 

-K 


-K 
K  j 


As 


results,  it  suffices  to  only  focus  on  one  dimension, 


which  we  choose  without  loss  of  generality  to  be  the  first.  We  get:  (kHhbh.s)(l)  =  4bh|shh, 
(kHH2bht)(l)  =  8bh|th^,  (lk  -Hh)(l)  =  1  -4h£,  and  finally: 


4hh  ,  8h2  u 

l_4h2bH|s  i_4h2bH|t 


(5.32) 


We  note  that  Equation  5.32  differs  from  Equation  4.19  in  Chapter  4  by  a  factor  of  2  on  the 
right  side.  This  is  the  result  of  the  decision  to  center  the  messages  around  \  in  Chapter  5,  whereas 
here  we  center  them  around  1:  Centering  messages  around  1  allowed  us  to  ignore  incoming 
messages  that  have  no  residuals  (i.e.,  mst  =  1)  and  was  a  crucial  assumption  in  our  derivations 
(see  Section  5.3.1).  This  difference  leads  to  a  factor  2  in  Equation  5.9  for  k  =  2,  and  thus  a  factor 
2  difference  in  Equation  5.32.  However,  both  alternative  centering  approaches  ultimately  lead  to 
the  same  equation  in  the  binary  case: 


b  = 


2h 

1  -4h2 


A  + 


4h2 

1  -4h2 


D 


where  b  and  4>  are  the  column  vectors  that  contain  the  first  dimension  of  the  binary  centered 
beliefs  for  each  node.  This  can  be  easily  seen  by  writing  the  original,  non-simplified  version 
Equation  5.21  in  the  vectorized  form  of  Equation  5.22. 
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5.6  Experiments 

In  this  section,  we  experimentally  verify  how  well  our  new  method  LinBP  scales,  and  how  close  its 
top  belief  classification  matches  that  of  standard  BP. 

5.6.1  Experimental  setup 

We  implemented  main  memory-based  versions  of  BP  and  LinBP  in  JAVA.  The  implementation 
uses  optimized  libraries  for  sparse  matrix  operations  [  ].  When  timing  our  memory-based 

algorithms,  we  focus  on  the  running  times  for  computations  only  and  ignore  the  time  for  loading 
data  and  initializing  matrices.  We  are  mainly  interested  in  relative  performance  (LinBP  vs.  BP) 
and  scalability  with  graph  sizes.  Both  implementations  run  on  a  2.5  Ghz  Intel  Core  i5  with  16G  of 
main  memory  and  a  1TB  SSD  hard  drive.  To  allow  comparability  across  implementations,  we  limit 
evaluation  to  one  processor.  For  timing  results,  we  run  BP  and  LinBP  for  5  iterations. 

5.6.2  Data 

Synthetic  data  We  assume  a  scenario  with  k  =  3  classes  and  the  matrix  Hh|0  from  Table  5.4  as 
the  unsealed  coupling  matrix.  We  study  the  convergence  of  our  algorithms  by  scaling  Hh|0  with  a 
varying  parameter  ch-  We  created  nine  Kronecker  graphs  of  varying  sizes  (see  Table  5.3)  which 
are  known  to  share  many  properties  with  real  world  graphs  [  ,CKF05]7.  To  generate  initial  class 
labels  (explicit  beliefs),  we  pick  5%  of  the  nodes  in  each  graph  and  assign  two  random  numbers 
from  {—0.1,  —0.09, . . .  ,0.09, 0.1}  to  them  as  centered  beliefs  for  two  classes  (the  belief  in  the  third 
class  is  then  their  negative  sum  due  to  centering). 

DBLP  data  As  in  Chapter  4,  we  use  the  DBLP  dataset  from  [JSD  1  10]  which  consists  of  36  138 
nodes  representing  papers,  authors,  conferences,  and  terms.  Each  paper  is  connected  to  its  authors, 
the  conference  in  which  it  appeared  and  the  terms  in  its  title.  Overall,  the  graph  contains  341  564 
edges  (counting  edges  twice  according  to  their  direction).  Only  3  750  nodes  (i.e.,  «  10.4%)  are 
labeled  explicitly  with  one  of  4  classes:  AI  (Artificial  Intelligence),  DB  (Databases),  DM  (Data 
Mining),  and  IR  (Information  Retrieval).  We  are  assuming  homophily,  which  is  represented  by  the 
4  x  4-matrix  in  Figure  5.4.  Our  goal  is  to  label  the  remaining  89.6%  of  the  nodes. 

5.6.3  Experimental  Results 

Our  experimental  approach  is  justified  since  BP  has  previously  been  shown  to  work  well  in  real-life 
classification  scenarios.  Our  goal  in  this  work  is  not  to  justify  BP  for  such  inference,  but  rather 

7We  count  the  number  of  entries  in  A  as  the  number  of  edges;  thus,  each  edge  is  counted  twice,  i.e., 
(s,t)  equals  s  ->•  t  plus  t  ->■  s. 
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Table  5.3:  Synthetic  Data:  Kronecker  graphs. 


# 

Graph  characteristics 
Nodes  n  Edges  e  e/n 

Explicit  b. 

5%  l%o 

1 

243 

1024 

4.2 

12 

1 

2 

729 

4096 

5.6 

36 

1 

3 

2187 

16384 

7.6 

110 

3 

4 

6561 

65  536 

10.0 

328 

7 

5 

19  683 

262 144 

13.3 

984 

20 

6 

59  049 

1048  576 

17.8 

2  952 

60 

7 

177147 

4194304 

23.7 

8  857 

178 

8 

531441 

16777216 

31.6 

26572 

532 

9 

1594323 

67108  864 

42.6 

79  716 

1595 

Table  5.4:  Unsealed  residual 
coupling  matrix  Hh|0. 


1 

2 

3 

1 

10 

-4 

-6 

2 

-4 

7 

-3 

3 

-6 

-3 

9 

to  replace  BP  with  a  faster  and  simpler  semantics  that  gives  similar  classifications.  Therefore, 
we  take  the  top  beliefs  returned  by  BP  as  “ground  truth”  (GT)  and  are  interested  in  how  close 
the  classifications  returned  by  LinBP  come  for  varying  values  of  Hh|0.  We  measure  the  quality 
of  our  methods  with  precision  and  recall  as  follows:  Given  a  set  of  top  beliefs  Bgt  for  a  GT 
labeling  method  and  a  set  of  top  beliefs  Bo  of  another  method  (O),  let  Bn  be  the  set  of  shared 
beliefs:  Bn  =  Bgt  n  Bo-  Then,  recall  r  measures  the  portion  of  GT  beliefs  that  are  returned  by  O: 
r  =  |Bn |/|BgtL  and  precision  p  measures  the  portion  of  “correct”  beliefs  among  Bo:  p  =  |Bn|/|Bo|. 
Notice  that  this  method  naturally  handles  ties.  For  example,  assume  that  the  GT  assigns  classes 
ci,  C2,  C3  as  top  beliefs  to  3  nodes  vi,  V2,  V3,  respectively:  {vi  — *  ci,  V2  — »  C2,  V3  — »  C3},  whereas  the 
comparison  method  assigns  4  beliefs:  {vi  — >  {ci,  C2KV2  — >•  C2,  V3  — >•  C2}.  Then  r  =  2/3  and  p  =  2/4. 
As  alternative,  we  also  use  the  Fl-score,  which  is  the  harmonic  mean  of  precision  and  recall: 
h=  -2el 

p+r- 

Next  we  answer  three  questions  about  our  proposed  method  LinBP,  two  with  respect  to  timing 
and  one  to  quality. 

Question  1.  Timing:  How  fast  and  scalable  is  LinBP  as  compared  to  BP? 

Result  1.  The  main  memory  implementation  of  LinBP  is  up  to  600  times  faster  than  BP. 

Figure  5.3(a)  shows  our  timing  experiments,  and  Table  5.5  shows  the  times  for  the  5  largest 
graphs.  LinBP  shows  an  approximate  linear  scaling  behavior  in  the  number  of  edges.  For  reference, 
Figure  5.3(a)  shows  a  dashed  gray  line  that  represents  an  exact  linear  scalability  of  100000  edges 
per  second.  The  main-memory  implementation  of  LinBP  is  faster  than  that  of  BP  for  two  main 
reasons:  (i)  The  LinBP  update  equations  calculate  beliefs  as  function  of  beliefs.  In  contrast,  the  BP 
update  equations  calculate,  for  each  node,  outgoing  messages  as  function  of  incoming  messages; 
(ii)  Our  matrix  formulation  of  LinBP  enables  us  to  use  well-optimized  JAVA  libraries  for  matrix 
operations.  These  optimized  operations  lead  to  a  highly  efficient  algorithm. 
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(a)  Scalability  (b)  Recall  and  precision  on  #5  (c)  Recall  and  precision  on  #5 


Figure  5.3:  (a):  Scalability  of  methods  in  Java:  dashed  gray  lines  represent  linear  scalability,  (b)-(c): 
Quality  of  LinBP  with  respect  to  BP:  the  vertical  gray  lines  mark  t'n  =  0.0002  (from  the  sufficient 
convergence  criterion). 

Table  5.5:  Timing  results  of  all  methods  on  the  5  largest  synthetic  graphs. 


# 

BP 

LinBP 

Comparisons 

BP 

LinBP 

5 

2 

0.03 

60 

6 

11 

0.09 

120 

7 

62 

0.32 

198 

8 

430 

0.99 

433 

9 

2  514 

3.92 

642 

Question  2.  Quality:  How  do  the  top  belief  assignments  of  LinBP  and  LinBP*  compare  to  that  of  BP? 

Result  2.  BP,  LinBP,  and  LinBP*  give  almost  identical  top  belief  assignments  (for  en  given  by 
Lemma  5). 

Synthetic  Data.  Figure  5.3(b)  shows  recall  (r)  and  precision  (p)  of  LinBP  with  BP  as  GT 
(“LinBP  with  regard  to  BP”)  on  graph  #5.  Similar  results  hold  for  all  other  graphs.  The  vertical  gray 
lines  show  ch  =0.0002  and  ch  =0.0028,  which  result  from  our  sufficient  (Lemma  5)  and  the  exact 
(Lemma  4)  convergence  criteria  of  LinBP,  respectively.  The  plots  stop  earlier  than  ch  =0.0028  as 
BP  stops  converging  earlier.  We  see  that  LinBP  matches  the  top  belief  assignment  of  BP  exactly 
in  the  upper  range  of  the  guaranteed  convergence;  for  smaller  ch,  errors  result  from  roundoff 
errors  due  to  limited  precision  of  floating-point  computations.  We  thus  recommend  choosing  ch 
according  to  Lemma  4.  Overall  accuracy  (harmonic  mean  of  precision  and  recall)  is  still  >  99.9% 
across  all  en- 

Figure  5.3(c)  shows  that  the  results  of  LinBP  and  LinBP*  are  almost  identical  as  long  as  ch  is 
small  enough  for  the  algorithms  to  converge  (both  LinBP  and  LinBP*  always  return  unique  top 
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1 


Figure  5.4:  Unsealed  residual  coupling  matrix  Hh,0.  Figure  5.5:  FI  on  DBLP  data. 


belief  assignments;  thus,  r  and  p  are  identical  and  we  only  need  to  show  one  graph  for  both) .  The 
vertical  drops  in  r  and  p  correspond  to  choices  of  en  for  which  LinBP  stops  converging. 

DBLP  Data.  Figure  5.4  gives  the  unsealed  residual  coupling  matrix  Hh|0  for  the  DBLP  dataset. 
As  shown  in  Figure  5.5,  LinBP  performs  very  well  with  absolute  accuracy  above  95%.  LinBP 
and  LinBP*  approximate  BP  very  well  as  long  as  BP  converges.  LinBP  converges  for  ch  <0.0013, 
however  BP  stops  converging  earlier:  This  explains  the  gap  between  the  convergence  bounds  for 
LinBP,  and  when  the  accuracy  actually  drops.  For  very  small  ch,  floating-point  rounding  errors 
occur. 

In  summary,  LinBP  matches  the  classification  of  BP  very  well.  Misclassifications  are  mostly 
due  to  closely  tied  top  beliefs,  in  which  case  returning  both  tied  beliefs  would  arguably  be  the 
preferable  alternative. 


5.7  Related  Work 

The  two  main  philosophies  for  transductive  inference  are  logical  approaches  and  statistical  ap¬ 
proaches.  Logical  approaches  determine  the  solution  based  on  hard  rules,  and  are  common  in 
the  database  literature  [  GK+02,  GS10,  HIST03,  KK09].  Statistical  approaches  determine  the 
solution  based  on  soft  rules.  The  related  work  comprises  guilt-by-association  approaches,  which 
use  limited  prior  knowledge  and  network  effects  in  order  to  derive  new  knowledge.  The  main  alter¬ 
natives  are  semi-supervised  learning  (SSL),  random  walks  with  restarts  (RWR),  and  label  or  belief 
propagation  (BP).  SSL  methods  can  be  divided  into  low-density  separation  methods,  graph-based 
methods,  methods  for  changing  the  representation,  and  co-training  methods  (see  [  P07,  ZhuOf  ] 
for  overviews).  A  multi-class  approach  has  been  introduced  in  [  ].  RWR  methods  are 

used  to  compute  mainly  node  relevance;  e.g.,  original  and  personalized  PageRank  [  P98,  Hav03], 
lazy  random  walks  [MC07],  and  fast  approximations  [  TD04,  'FP06] .  For  more  pointers  to 
guilt-by-association  methods,  refer  to  Section  4.1. 
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Belief  Propagation:  Convergence  &  Speed-up  We  give  basic  background  information  on 
BP  in  Section  4.1.3.  As  a  reminder,  BP  (or  min-sum,  or  product-sum  algorithm)  is  an  iterative 
message-passing  algorithm  that  is  a  very  expressive  formalism  for  assigning  classes  to  unlabeled 
nodes  and  has  been  used  successfully  in  multiple  settings  for  solving  inference  problems,  such  as 
malware  detection [  ],  graph  similarity  [  ,  ]  (also  in  Chapter  7),  structure 

identification  [  ,  ]  (Chapter  3),  and  pattern  mining  and  anomaly  detection  [  ]. 

BP  solves  the  inference  problem  approximately;  it  is  known  that  when  the  factor  graph  has 
a  tree  structure,  it  converges  to  the  true  marginals  (stationary  point)  after  a  finite  number  of 
iterations.  Although  in  loopy  graphs  (which  is  the  case  for  most  real-world  graphs)  convergence  to 
the  true  marginals  is  not  guaranteed,  it  is  possible  in  locally  tree-like  structures.  As  a  consequence, 
approaches  in  the  database  community  that  rely  on  BP-based  inference  often  lack  convergence 
guarantees  [SAS1  ].  Convergence  of  BP  in  loopy  graphs  has  been  studied  before  [EMK06,  IIW05, 
] .  To  the  best  of  our  knowledge,  all  existing  bounds  for  BP  give  only  sufficient  convergence 
criteria.  In  contrast,  our  work  presents  a  stronger  result  by  providing  sufficient  and  necessary 
conditions  for  the  convergence  of  LinBP,  which  is  itself  an  approximation  of  BP.  Other  recent 
work  [  ]  studies  a  form  of  linearization  for  unsupervised  classification  in  the  stochastic 

block  model  without  an  obvious  way  to  include  supervision  in  this  setting. 

There  exist  various  works  that  speed  up  BP  by:  (i)  exploiting  the  graph  structure  [CGI  , 
PCWFO1  ],  (ii)  changing  the  order  of  message  propagation  [EMK06,  GLG09,  MKO  ],  or  (iii)  using 
the  MapReduce  framework  [  CCF1  ].  Here,  we  derive  a  linearized  formulation  of  standard  BP.  This 
is  a  multivariate  generalization  of  the  linearized  belief  propagation  algorithm  FaBP  (Chapter  4, 
[KKK 1  ])  from  binary  to  multiple  labels  for  classification. 


5.8  Summary 

This  work  showed  that  the  widely  used  multi-class  belief  propagation  algorithm  can  be  approxi¬ 
mated  by  a  linear  system  that  replaces  multiplication  with  addition.  Specifically: 

•  Problem  Formulation:  We  contribute  a  new,  fast  and  compact  matrix  formulation  for 
multi-class  BP  called  Linearized  Belief  Propagation  (LinBP)  (Section  5.2). 

•  Theory:  Based  on  the  linear  system,  we  give  a  closed-form  solution  with  the  help  of  the 
inverse  of  an  appropriate  matrix  (Section  5.3.2),  and  also  explain  exactly  when  the  system 
will  converge. 

•  Experiments  on  Real  Graphs:  We  show  that  a  main- memory  implementation  of  LinBP  is 
faster  than  BP,  while  achieving  comparable  accuracy. 
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Part  II 

Multiple-Graph  Exploration 
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Exploring  Multiple  Graphs :  Overview 


In  many  applications,  it  is  necessary  or  at  least  beneficial  to  explore  multiple  graphs  at  the  same 
time.  These  graphs  can  be  temporal  instances  of  the  same  set  of  objects  (time-evolving  graphs),  or 
disparate  networks  coming  from  different  sources.  At  a  macroscopic  level,  how  can  we  extract 
easy-to-understand  building  blocks  from  a  series  of  massive  graphs  and  summarize  the  dynamics  of 
their  underlying  phenomena  (e.g.,  communication  patterns  in  a  large  phonecall  network)?  How 
can  we  find  anomalies  in  a  time-evolving  corporate-email  correspondence  network  and  predict 
the  fall  of  a  company?  Are  there  differences  in  the  brain  wiring  of  more  creative  and  less  creative 
people?  How  do  different  types  of  communication  (e.g.,  messages  vs.  wall  posts  in  Facebook)  and 
their  corresponding  behavioral  patterns  compare?  In  these  and  more  applications,  the  underlying 
problem  is  often  graph  similarity.  In  this  part  of  the  dissertation  we  examine  three  main  ways  of 
exploring  and  understanding  multiple,  large-scale  graphs: 

•  Scalable  summarization  of  the  graphs  in  terms  of  important  temporal  graph  structures, 
which  enable  sense-making  and  visualization  (Chapter  6); 

•  Graph  similarity,  which  provides  a  fast  and  principled  way  of  comparing  two  graphs  that 
are  aligned  (e.g.,  brain  connectivity  networks)  and  explaining  their  differences  (Chapter  7); 

•  Graph  alignment,  which  is  a  generalization  of  the  previous  problem  and  finds  the  node 
correspondence  and  similarity  between  two  unaligned  networks  (Chapter  8) . 

For  each  thrust,  we  give  observations,  and  models  for  real-world  graphs  followed  by  efficient 
algorithms  to  explore  the  different  aspects  (important  temporal  structures,  similarities)  of  multiple 
graphs  to  gain  a  better  understanding  of  the  processes  they  capture. 
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Chapter  based  on  work  that  appeared  at  KDD  2015  [SKZ+15]. 


Chapter  6 

Summarization  of  Dynamic  Graphs 


Given  a  large  phonecall  network  over  time,  how  can  we  describe  it  to  a  practitioner  with 
just  a  few  phrases?  Other  than  the  traditional  assumptions  about  real-world  graphs  involving 
degree  skewness,  what  can  we  say  about  their  connectivity?  For  example,  is  the  dynamic  graph 
characterized  by  many  large  cliques  which  appear  at  fixed  intervals  of  time,  or  perhaps  by  several 
large  stars  with  dominant  hubs  that  persist  throughout?  In  this  chapter  we  focus  on  these 
questions,  and  specifically  on  constructing  concise  summaries  of  large,  real-world  dynamic  graphs 
in  order  to  better  understand  their  underlying  behavior.  Here  we  extend  our  work  on  single-graph 
summarization  which  we  described  in  Chapter  3. 

The  problem  of  dynamic  graph  summarization  has  numerous  practical  applications.  Dynamic 
graphs  are  ubiquitously  used  to  model  the  relationships  between  various  entities  over  time,  which 
is  a  valuable  feature  in  almost  all  applications  in  which  nodes  represent  users  or  people.  Examples 
include  online  social  networks,  phonecall  networks,  collaboration  and  coauthorship  networks,  and 
other  interaction  networks. 

Though  numerous  graph  algorithms,  such  as  modularity-based  community  detection,  spectral 
clustering,  and  cut-based  partitioning,  are  suitable  for  static  contexts,  they  do  not  offer  direct 
dynamic  counterparts.  Furthermore,  the  traditional  goals  of  clustering  and  community  detection 


Table  6.1:  Feature-based  comparison  of  TimeCrunch  with  alternative  approaches. 
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tasks  are  not  quite  aligned  with  the  endeavor  we  propose.  These  algorithms  typically  produce 
groupings  of  nodes  which  satisfy  or  approximate  some  optimization  function.  However,  they  do 
not  offer  a  characterization  of  the  outputs  -  are  the  detected  groupings  stars  or  chains,  or  perhaps 
dense  blocks?  Furthermore,  the  lack  of  explicit  ordering  in  the  groupings  leaves  a  practitioner 
with  limited  time  and  no  insights  on  where  to  begin  understanding  his  data. 

Here  we  propose  TimeCrunch,  an  effective  approach  to  concisely  summarizing  large,  dynamic 
graphs  which  extend  beyond  traditional  dense  and  isolated  “caveman”  communities.  Similarly 
to  the  single-graph  summarization  work  described  in  Chapter  3,  we  leverage  MDL  (Minimum 
Description  Length)  in  order  to  find  succinct  graph  patterns.  In  contrast  to  the  static  vocabulary 
we  introduced  before,  in  this  chapter  we  seek  to  identify  and  appropriately  describe  graphs  over 
time  using  a  lexicon  of  temporal  phrases  which  describe  temporal  connectivity  behavior.  Figure  6.1 
shows  results  found  from  applying  TimeCrunch  to  real-world  dynamic  graphs. 

•  Figure  6.1(a)  shows  a  constant  near-clique  of  40  users  in  the  Yahoo!  messaging  network 
interacting  over  4  weeks  in  April  2008.  The  relevant  subgraph  has  an  unnaturally  high  55% 
density  over  this  duration.  One  possible  explanation  is  that  these  users  may  be  bots  that 
message  each  other  in  an  effort  to  appear  normal  and  avoid  suspension.  We  cannot  verify, 
as  the  dataset  is  anonymized  for  privacy  purposes. 

•  Figure  6.1(b)  depicts  a  periodic  star  of  111  callers  in  the  phonecall  network  of  a  large, 
anonymous  Asian  city  during  the  last  week  of  December  2007.  We  observe  that  the  star 
behavior  oscillates  over  time,  and,  specifically,  odd-numbered  timesteps  have  stronger  star 
structure  than  the  even-numbered  ones.  Furthermore,  the  appearance  of  the  star  is  strongest 
on  Dec.  25th  and  31st,  corresponding  to  major  holidays. 

•  Lastly,  Figure  6.1  (c)  shows  a  ranged  near-clique  of  43  authors  found  in  the  DBLP  network 
who  jointly  published  in  biotechnology  journals,  such  as  Nature  and  Genome  Research  from 
2005-2012.  This  observation  agrees  with  intuition  as  works  in  this  field  typically  have  many 


(b)  111  callers  in  a  large  phonecall 
network,  forming  a  periodic  star, 
over  the  last  week  of  December 
2007  (heavy  activity  on  holidays). 


(a)  40  users  of  Yahoo!  Messenger 
forming  a  constant  near  clique  with 
unusually  high  55%  density,  over 
4  weeks  in  April  2008. 


(c)  43  collaborating  biotech,  au¬ 
thors  forming  a  ranged  near  clique 
in  the  DBLP  network,  jointly  pub¬ 
lishing  through  2005-2012. 


Figure  6.1:  TimeCrunch  finds  coherent,  interpretable  temporal  structures.  We  show  the  reordered 
subgraph  adjacency  matrices,  over  the  timesteps  of  interest,  each  outlined  in  gray;  edges  are  plotted 
in  alternating  red  and  blue,  for  discernability. 
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co-authors.  The  first  and  last  timesteps  shown  have  very  sparse  connectivity  and  were  not 
part  of  the  detected  structure — they  serve  only  to  demarcate  the  range  of  activity 

In  this  work,  we  seek  to  answer  the  following  informally  posed  problem: 

Problem  Definition  5.  [Dynamic  Graph  Summarization  -  Informal] 

•  Given  a  dynamic  graph,  i.e.,  a  time  sequence  of  adjacency  matrices1  Ai,  A2, . . . ,  At, 

•  Find:  a  set  of  possibly  overlapping  temporal  subgraphs  to  concisely  describe  the  given 
dynamic  graph  in  a  scalable  fashion. 

Our  main  contributions  are  as  follows: 

1.  Problem  Formulation:  We  show  how  to  define  the  problem  of  dynamic  graph  understand¬ 
ing  in  in  a  compression  context. 

2.  Effective  and  Scalable  Algorithm:  We  develop  TimeCrunch,  a  fast  algorithm  for  dy¬ 
namic  graph  summarization.  Our  code  for  TimeCrunch  is  open-sourced  at  http:// 
danaikoutra . com/ CODE/timecrunch .tar . gz. 

3.  Experiments  on  Real  Graphs:  We  evaluate  TimeCrunch  on  multiple  real,  dynamic  graphs 
and  show  quantitative  and  qualitative  results. 

In  this  chapter  we  first  give  the  problem  formulation  in  Section  6.1  and  then  describe  our  pro¬ 
posed  method  in  detail  in  Section  6.2.  Next,  we  empirically  evaluate  TimeCrunch  in  Section  6.3 
using  qualitative  and  quantitative  experiments  on  a  variety  of  real  dynamic  graphs.  We  refer  to 
the  related  work  and  summarize  our  contributions  in  Section  6.4  and  Section  6.5,  respectively. 


6.1  Problem  Formulation 

In  this  section,  we  give  the  first  main  contribution  of  our  work:  formulation  of  dynamic  graph 
summarization  as  a  compression  problem,  using  MDL.  For  clarity,  in  Table  6.2  we  provide  the 
recurrent  symbols  used  in  this  chapter — for  the  reader’s  convenience,  we  repeat  the  definitions  of 
symbols  that  we  introduced  in  Chapter  3  for  static  graph  summarization. 

As  a  reminder  to  the  reader,  the  Minimum  Description  Length  (MDL)  principle  aims  to  be  a 
practical  version  of  Kolmogorov  Complexity  [LV93],  often  associated  with  the  motto  Induction  by 
Compression.  MDL  states  that  given  a  model  family  M,  the  best  model  M  e  M  for  some  observed 
data  D  is  that  which  minimizes  L(M)  +  L(D|M),  where  L(M)  is  the  length  in  bits  used  to  describe 
M  and  L(D|M)  is  the  length  in  bits  used  to  describe  T>  encoded  using  M.  MDL  enforces  lossless 
compression  for  fairness  in  the  model  selection  process. 

For  our  application,  we  focus  on  the  analysis  of  undirected  dynamic  graphs  in  tensor  fashion 
using  fixed-length,  discretized  time  intervals.  However,  our  notation  will  reflect  the  treatment 
of  the  problem  as  one  with  a  series  of  individual  snapshots  of  graphs,  rather  than  a  tensor,  for 

JIf  the  graphs  have  different,  but  overlapping  node  sets,  Vi,  V2,  . . . ,  Vt,  we  assume  that  V  =  Vi  U  V2  U 
. . .  U  Vt,  and  the  disconnected  nodes  are  treated  as  singletons. 
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Table  6.2:  TimeCrunch:  Frequently  used  symbols  and  their  definitions. 


Symbol 

Definition 

G,  A 

dynamic  graph  and  adjacency  tensor,  respectively 

V.n 

node-set  and  number  of  nodes  of  G,  respectively 

£,  m 

edge-set  and  number  of  edges  of  G,  respectively 

Gx?  Ax 

xtH  timestep,  adjacency  matrix  of  G,  respectively 

£x, 

edge-set  and  number  of  edges  of  Gx,  respectively 

fc,  nc 

fall  clique  and  near  clique,  respectively 

fb,  nb 

fall  bipartite  core  and  near  bipartite  core,  respectively 

St 

star  graph 

ch 

chain  graph 

0 

oneshot 

c 

constant 

r 

ranged 

P 

periodic 

f 

flickering 

t 

total  number  of  timesteps  for  the  dynamic  graph 

A 

set  of  temporal  signatures 

a 

set  of  static  identifiers 

0 

lexicon,  set  of  temporal  phrases  0  =  A  x  Q. 

X 

Cartesian  set  product 

M,  s 

model  M,  temporal  structure  s  e  JVL,  respectively 

|S| 

cardinality  of  set  S 

|s| 

#  of  nodes  in  structure  s 

u(s) 

timesteps  in  which  structure  s  appears 

v(s) 

temporal  phrase  of  structure  s,  v(s)  e  O 

M 

approximation  of  A  induced  by  M 

E 

error  matrix  E  =  M  ®  E 

® 

exclusive  OR 

L(G,  M) 

#  of  bits  used  to  encode  M  and  G  given  M 

L(M) 

#  of  bits  to  encode  M 

readability  purposes.  Specifically,  we  consider  a  dynamic  graph  G(V,  £)  with  n  =  |V|  nodes, 
m  =  |£|  edges  and  t  timesteps,  without  self-loops.  Here,  G  =  UXGX(V,  £x),  where  Gx  and  Ex 
correspond  to  the  graph  and  edge-set  for  the  xth  timestep.  The  ideas  proposed  in  this  work, 
however,  can  easily  be  generalized  to  other  types  of  dynamic  graphs. 

For  our  summary,  we  consider  the  set  of  temporal  phrases  O  =  A  x  O,  where  A  corresponds 
to  the  set  of  temporal  signatures.  Cl  corresponds  to  the  set  of  static  structure  identifiers  and  x 
denotes  the  Cartesian  set  product.  Though  we  can  include  arbitrary  temporal  signatures  and  static 
structure  identifiers  into  these  sets  depending  on  the  types  of  temporal  subgraphs  we  expect  to 
find  in  a  given  dynamic  graph,  we  choose  five  temporal  signatures  which  we  anticipate  to  find  in 
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real-world  dynamic  graphs  [  ]  :  oneshot  (o),  ranged  (r),  periodic  (p),  flickering  (f)  and 

constant  (c),  and  six  very  common  structures  found  in  real-world  static  graphs  (Chapter  3) — stars 
( st),  full  and  near  cliques  (fc ,  nc),full  and  near  bipartite  cores  (be,  nb )  and  chains  (eh).  In  summary, 
we  have  the  signatures  A  =  (o,  r,  p,  f,  c},  static  identifiers  Cl  =  (st,  fc,  nc,  be,  nb,  ch}  and  temporal 
phrases  ®  =  A  x  GL  We  will  further  describe  these  signatures,  identifiers  and  phrases  after 
formalizing  our  objective. 

In  order  to  use  MDL  for  dynamic  graph  summarization  using  these  temporal  phrases,  we  next 
define  the  model  family  M,  the  means  by  which  a  model  MeM  describes  our  dynamic  graph  and 
how  to  quantify  the  cost  of  encoding  in  terms  of  bits. 


6.1.1  Using  MDL  for  Dynamic  Graph  Summarization 

We  consider  models  M  €  M  to  be  composed  of  ordered  lists  of  temporal  graph  structures  with 
node  overlaps,  but  no  edge  overlaps.  Each  s  €  M  describes  a  certain  region  of  the  adjacency 
tensor  A  in  terms  of  the  interconnectivity  of  its  nodes.  We  will  use  area(s,  M,  A)  to  describe  the 
edges  (i,  j,x)  e  A  which  s  induces,  writing  only  area(s)  when  the  context  for  M  and  A  is  clear. 

Our  model  family  M  consists  of  all  possible  permutations  of  subsets  of  G,  where  C  =  UVCV 
and  Cv  denotes  the  set  of  all  possible  temporal  structures  of  phrase  v  e  O  over  all  possible 
combinations  of  timesteps.  That  is,  M  consists  of  all  possible  models  M,  which  are  ordered  lists 
of  temporal  phrases  v  €  <D,  such  as  flickering  stars  (fst ),  periodic  full  cliques  ( pfe ),  etc.,  over  all 
possible  subsets  of  V  and  Gi  •  •  •  Gt.  Through  MDL,  we  seek  the  model  MeM  which  minimizes 
the  encoding  length  of  the  model  M  and  the  adjacency  tensor  A  given  M. 

Our  fundamental  approach  for  transmitting  the  adjacency  tensor  A  via  the  model  M  is 
described  next.  First,  we  transmit  M.  Next,  given  M,  we  induce  the  approximation  of  the 
adjacency  tensor  M  as  described  by  each  temporal  structure  s  €  M.  For  each  structure  s,  we 
induce  the  edges  in  area(s)  in  M  accordingly.  Given  that  M  is  a  summary  approximation  to 
A,  M^A  most  likely.  Since  MDL  requires  lossless  encoding,  we  must  also  transmit  the  error 
E  =  M  ©  A,  obtained  by  taking  the  exclusive  OR  between  M  and  A.  Given  M.  and  E,  a  recipient 
can  construct  the  full  adjacency  tensor  A  in  a  lossless  fashion. 

Thus,  we  formalize  the  problem  we  tackle  as  follows: 

Problem  Definition  6.  [Minimum  Dynamic  Graph  Description]  Given  a  dynamic  graph  G  with 
adjacency  tensor  A  and  temporal  phrase  lexicon  ®,  find  the  smallest  model  M  which  minimizes 
the  total  encoding  length 

L(G,M)  =  L(M)  +  L(E) 

where  E  =  M  ©  A  is  the  error  matrix  and  M  is  the  approximation  of  A  induced  by  M. 

In  the  following  subsections,  we  further  formalize  the  task  of  encoding  the  model  M  and  the 
error  matrix  E. 
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6.1.2  Encoding  the  Model 

The  encoding  length  to  fully  describe  a  model  M  e  3Vt  is: 

L(M)  =  LN(|M|  +  1)  +  log  (|M|  +  ^j“  l)+Y-  (-logP(v(s)IM)  +  L(c(s))  +  L(u(s)))  . 

^  '  SGM 

v - - - '  V ' 

#  of  structures  in  total,  and  per  type  per  structure:  type,  connectivity  and  temporal  details 

(6.1) 

We  begin  by  transmitting  the  total  number  of  temporal  structures  in  M  using  Lpj,  Rissanen’s 
optimal  encoding  for  integers  greater  than  or  equal  to  1  [  ].  Next,  we  optimally  encode  the 

number  of  temporal  structures  for  each  phrase  v  e  ®  in  M.  Then,  for  each  structure  s,  we  encode 
the  type  v(s)  for  each  structure  s  e  M  using  optimal  prefix  codes  [CT06],  the  connectivity  c(s) 
and  the  temporal  presence  of  s,  consisting  of  the  ordered  list  of  timesteps  u(s)  in  which  s  appears. 

In  order  to  have  a  coherent  model  encoding  scheme,  we  next  define  the  encoding  for  each 
phrase  v  €  <D  such  that  we  can  compute  L(c(s))  and  L(u(s))  for  all  structures  in  M.  The 
connectivity  c(s)  corresponds  to  the  edges  in  area(s)  which  are  induced  by  s,  whereas  the  temporal 
presence  u(s)  corresponds  to  the  timesteps  in  which  s  is  present.  We  consider  the  connectivity  and 
temporal  presence  separately,  as  the  encoding  for  a  temporal  structure  s  described  by  a  phrase  v  is 
the  sum  of  encoding  costs  for  the  connectivity  of  the  corresponding  static  structure  identifier  in  Cl 
and  its  temporal  presence  as  indicated  by  a  temporal  signature  in  A. 

Encoding  Connectivity 

To  compute  the  encoding  cost  L(c(s))  for  the  connectivity  for  each  type  of  static  structure  identifier 
in  our  identifier  set  Cl  (i.e.,  cliques,  near-cliques,  bipartite  cores,  near-bipartite  cores,  stars  and 
chains)  we  use  the  formulas  introduced  in  Section  3.2.2. 

Encoding  Temporal  Presence 

For  a  given  phrase  v  e  O,  it  is  not  sufficient  to  only  encode  the  connectivity  of  the  underlying 
static  structure.  For  each  structure  s,  we  must  also  encode  the  temporal  presence  u(s),  consisting 
of  a  set  of  ordered  timesteps  in  which  s  appears.  In  this  section,  we  describe  how  to  compute  the 
encoding  cost  L(u(s))  for  each  of  the  temporal  signatures  in  the  signature  set  A. 

We  note  that  describing  a  set  of  timesteps  u(s)  in  terms  of  temporal  signatures  in  A  is  yet 
another  model  selection  problem  for  which  we  can  leverage  MDL.  As  with  connectivity  encoding, 
labeling  u(s)  with  a  given  temporal  signature  may  not  be  precisely  accurate  -  however,  any 
mistakes  will  add  to  the  cost  of  transmitting  the  error.  Errors  in  temporal  presence  encoding  will 
be  further  detailed  in  Section  6.1.3. 

Oneshot:  Oneshot  structures  appear  at  only  one  timestep  in  Gi  •  •  •  Gt  -  that  is,  |u(s)|  =  1.  These 
structures  represent  graph  anomalies,  in  the  sense  that  they  are  non-recurrent  interactions  which 


100 


are  only  observed  once.  The  encoding  cost  L(o)  for  the  temporal  presence  of  a  oneshot  structure  o 
can  be  written  as: 


L(o)  =  log(t) 

As  the  structure  occurs  only  once,  we  only  have  to  identify  the  timestep  of  occurrence  from  the  t 
observed  timesteps. 

Ranged:  Ranged  structures  are  characterized  by  a  short-lived  existence.  These  structures  appear 
for  several  timesteps  in  a  row  before  disappearing  again  -  they  are  defined  by  a  single  burst  of 
activity.  The  encoding  cost  L(r)  for  a  ranged  structure  r  is  given  by: 

L(r)  =  Ln(|u(s)|)  +  log  Q 

#  of  timesteps  start  and  end  timestep  IDs 

We  first  encode  the  number  of  timesteps  in  which  the  structure  occurs,  followed  by  the  timestep 
IDs  of  both  the  start  and  end  timestep  marking  the  span  of  activity. 

Periodic:  Periodic  structures  are  an  extension  of  ranged  structures  in  that  they  appear  at  fixed 
intervals.  However,  these  intervals  are  spaced  greater  than  one  timestep  apart.  As  such,  the  same 
encoding  cost  function  we  use  for  ranged  structures  suffices  here.  That  is,  L(p)  for  a  periodic 
structure  p  is  given  by  L(p)  =  L(r). 

For  both  ranged  and  periodic  structures,  periodicity  can  be  inferred  from  the  start  and  end 
markers  along  with  the  number  of  timesteps  |u(s)|,  allowing  the  reconstruction  of  the  original 

u(s). 

Flickering:  A  structure  is  flickering  if  it  appears  only  in  some  of  the  G  j  -  •  •  G  t  timesteps,  and 
does  so  without  any  discernible  ranged/periodic  pattern.  The  encoding  cost  L(f)  for  a  flickering 
structure  f  is  as  follows: 


L(f>  =  L»"u<s»>  +  l0«  (ms)|) 

#  of  timesteps  IDs  for  the  timesteps  of  occurrence 

We  encode  the  number  of  timesteps  in  which  the  structure  occurs  in  addition  to  the  IDs  for  the 
timesteps  of  occurrence. 

Constant:  Constant  structures  persist  throughout  all  timesteps.  That  is,  they  occur  at  each 
timestep  Gi  •  •  •  Gt  without  exception.  In  this  case,  our  encoding  cost  L(c)  for  a  constant  structure 
c  is  defined  as  L(c)  =  0.  Intuitively,  information  regarding  the  timesteps  in  which  the  structure 
appears  is  “free”,  as  it  is  already  given  by  encoding  the  phrase  descriptor  v(s). 
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6.1.3  Encoding  the  Errors 

Given  that  M  is  a  summary  and  the  M  induced  by  M  is  only  an  approximation  of  A,  it  is  necessary 
to  encode  errors  made  by  M.  In  particular,  there  are  two  types  of  errors  we  must  consider.  The 
first  is  error  in  connectivity  -  that  is,  if  area(s)  induced  by  structure  s  is  not  exactly  the  same 
as  the  associated  patch  in  A,  we  encode  the  relevant  mistakes.  The  second  is  the  error  induced 
by  encoding  the  set  of  timesteps  u(s)  with  a  fixed  temporal  signature,  given  that  u(s)  may  not 
precisely  follow  the  temporal  pattern  used  to  encode  it. 

Encoding  Errors  in  Connectivity 

Following  the  same  principles  described  in  Section  3.2.3,  we  encode  the  error  tensor  E  =  M  ©  A 
as  two  different  pieces:  E+  and  E  .  The  first  piece,  E+,  refers  to  the  area  of  A  which  M  models 
and  the  area  of  M  that  includes  extraneous  edges  not  present  in  the  original  graph.  The  second 
piece,  E“,  consists  of  the  area  of  A  which  M  does  not  model  and  therefore  does  not  describe.  As  a 
reminder,  the  encoding  for  E+  and  E_  is: 

L(E+)  =  log(|E+|)  +  HE+lllj  +  IIE+H'lo 
L(E-)  =  log(|E-|)  +  NE-IH!  +  llE-lKlo 


#  of  edges  edges 

where  ||E||  and  ||E||'  denote  the  counts  for  existing  and  non-existing  edges  in  area(E),  respectively. 
Then,  li  =  —  log(||E||/(||E||  +  ||E||'))  and  lo  =  — log(||E||// (||E||  +  ||E||'))  represent  the  length  of  the 
optimal  prefix  codes  for  the  existing  and  non-existing  edges,  respectively.  For  more  explanations 
about  our  choices,  refer  to  Section  3.2.3. 

Encoding  Errors  in  Temporal  Presence 

For  encoding  errors  induced  by  identifying  u(s)  as  one  of  the  temporal  signatures,  we  turn  to 
optimal  prefix  codes  applied  over  the  error  distribution  for  each  structure  s.  Given  the  information 
encoded  for  each  signature  type  in  A,  we  can  reconstruct  an  approximation  u(s)  of  the  original 
timesteps  u(s)  such  that  |u(s)|  =  |u(s)|.  Using  this  approximation,  the  encoding  cost  L(eu(s))  for 
the  error  eu(s)  =  u(s)  —  u(s)  is  defined  as 

L(eu(s))  =  Y.  (  lo§(k)  +  loScM  +c(k)lk  ) 

keH(eu(s)) 

error  magnitude  #  of  occurrences  error 

where  h(eu(s))  denotes  the  set  of  elements  with  unique  magnitude  in  eu(s),  c(k)  denotes  the 
count  of  element  k  in  eu(s)  and  lk  denotes  the  length  of  the  optimal  prefix  code  for  k.  For 
each  magnitude  error,  we  encode  the  magnitude  of  the  error,  the  number  of  times  it  occurs,  and 
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the  actual  errors  using  optimal  prefix  codes.  Using  the  model  in  conjunction  with  the  temporal 
presence  and  connectivity  errors,  a  recipient  can  first  recover  the  u(s)  for  each  s  e  M,  approximate 
A  with  M  induced  by  M,  produce  E  from  E+  and  E“,  and  finally  recover  A  losslessly  through 
A  =  M  ©  E. 

Remark:  For  a  dynamic  graph  G  of  n  nodes,  the  search  space  M  for  the  best  model  M  e  M  is 
intractable,  as  it  consists  of  all  the  permutations  of  all  possible  temporal  structures  over  the  lexicon 
®,  over  all  possible  subsets  over  the  node-set  V  and  over  all  possible  graph  timesteps  Gi  •  •  •  Gt. 
Furthermore,  M  is  not  easily  exploitable  for  efficient  search.  Thus,  we  propose  several  practical 
approaches  for  the  purpose  of  finding  good  and  interpretable  temporal  models/summaries  for  G . 


6.2  Proposed  Method:  TimeCrunch 

Thus  far,  we  have  described  our  strategy  of  formulating  dynamic  graph  summarization  as  a 
problem  in  a  compression  context  for  which  we  can  leverage  MDL.  Specifically,  we  have  detailed 
how  to  encode  a  model  and  the  associated  error  which  can  be  used  to  losslessly  reconstruct  the 
original  dynamic  graph  G.  Our  models  are  characterized  by  ordered  lists  of  temporal  structures 
which  are  further  classified  as  phrases  from  the  lexicon  ®.  Each  s  e  Mis  identified  by  a  phrase 
p  e  ®  over 

•  the  node  connectivity  c(s),  i.e.,  an  induced  set  of  edges  depending  on  the  static  structure 
identifier  st,  fc,  etc.),  and 

•  the  associated  temporal  presence  u(s),  i.e.,  an  ordered  list  of  timesteps  captured  by  a 
temporal  signature  o,  r,  etc.  and  deviations)  in  which  the  temporal  structure  is  active. 

The  error  consists  of  the  edges  which  are  not  covered  by  M,  the  approximation  of  A  induced  by  M. 

Next,  we  discuss  how  we  find  good  candidate  temporal  structures  to  populate  the  candidate 
set  C,  as  well  as  how  we  find  the  best  model  M  with  which  to  summarize  our  dynamic  graph.  The 
pseudocode  for  our  algorithm  is  given  in  Algorithm  6.3  and  the  next  subsections  detail  each  step 
of  our  approach. 

6.2.1  Generating  Candidate  Static  Structures 

TimeCrunch  takes  an  incremental  approach  to  dynamic  graph  summarization.  That  is,  our 
approach  begins  with  considering  potentially  useful  subgraphs  for  summarization  over  the  static 
graphs  Gi  •  •  •  Gt.  There  are  numerous  static  graph  decomposition  algorithms  which  enable 
community  detection,  clustering  and  partitioning  for  these  purposes.  Several  of  these  approaches 
are  mentioned  in  Section  6.4  -  these  include  EigenSpokes  [  ’SS  1  10],  METIS  [KK95],  spectral 
partitioning  [  Y99],  Graclus  [  ],  cross-associations  [  ],  Subdue  [  ]  and 

SlashBurn  [  ].  Summarily,  for  each  Gi . . .  Gt,  a  set  of  subgraphs  T  is  produced. 
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Algorithm  6.3  TimeCrunch 

1:  Generating  Candidate  Static  Subgraphs:  Generate  static  subgraphs  for  each  Gi  •  •  •  Gt  using 
traditional  static  graph  decomposition  approaches. 

2:  Labeling  Candidate  Static  Subgraphs:  Label  each  static  subgraph  as  a  static  structure 
corresponding  to  the  identifier  xefi  which  minimizes  the  local  encoding  cost. 

3:  Stitching  Candidate  Temporal  Structures:  Stitch  the  static  structures  from  Gi . . .  Gt  together 
to  form  temporal  structures  with  coherent  connectivity  behavior,  and  label  them  according  to 
the  phrase  peO  which  minimizes  the  temporal  presence  encoding  cost. 

4:  Composing  the  Summary:  Compose  a  model  M  of  important,  non-redundant  temporal 
structures  which  summarize  G  using  the  Vanilla,  TopIO,  Top-100  and  Stepwise  heuristics. 
Choose  M  associated  with  the  heuristic  that  produces  the  smallest  total  encoding  cost. 


6.2.2  Labeling  Candidate  Static  Structures 

Once  we  have  the  set  of  static  subgraphs  from  Gi . . .  Gt,  T,  we  next  seek  to  label  each  subgraph 
in  SF  according  to  the  static  structure  identifiers  in  Cl  that  best  fit  the  connectivity  for  the  given 
subgraph. 

Definition  3.  [Static  structures]  A  static  structure  is  a  static  subgraph  that  is  labeled  with  a  static 
identifier  in  Q  =  (fc,  nc,  fb,  nb,  st,  ch}. 

For  each  subgraph  construed  as  a  set  of  nodes  £  €  V  for  a  fixed  timestep,  does  the  adjacency 
matrix  of  £  best  resemble  a  star,  near  or  full  clique,  near  or  full  bipartite  core  or  a  chain?  To 
answer  this  question,  we  leverage  the  encoding  scheme  discussed  in  Section  3.3.2.  In  a  nutshell, 
we  try  encoding  the  subgraph  £  using  each  of  the  static  identifiers  in  Cl  and  label  it  with  the 
identifier  xefi  which  minimizes  the  encoding  cost. 


6.2.3  Stitching  Candidate  Temporal  Structures 

Thus  far,  we  have  a  set  of  static  subgraphs  T  over  Gi . . .  Gt  labeled  with  the  associated  static 
identifiers  which  best  represent  the  subgraph  connectivity  (from  now  on,  we  refer  to  T  as  a  set  of 
static  structures  instead  of  subgraphs  as  they  have  been  labeled  with  identifiers).  From  this  set,  our 
goal  is  to  find  temporal  structures  -  namely,  we  seek  to  find  static  subgraphs  which  have  the  same 
patterns  of  connectivity  over  one  or  more  timesteps  and  stitch  them  together.  Thus,  we  formulate 
the  problem  of  finding  coherent  temporal  structures  in  G  as  a  clustering  problem  over  J.  Though 
there  are  several  criteria  that  we  could  use  for  clustering  static  structures  together,  we  employ  the 
following  based  on  their  intuitive  meaning: 
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Definition  4.  [Temporal  structures]  Two  static  structures  belong  to  the  same  temporal  structure 
(i.e.,  they  are  in  the  same  cluster)  if  they  have 

•  substantial  overlap  in  the  node-sets  composing  their  respective  subgraphs,  and 

•  exactly  the  same,  or  similar  (full  and  near  clique,  or  full  and  near  bipartite  core)  static 

_ structure  identifiers. _ 

These  criteria,  if  satisfied,  allow  us  to  find  groups  of  nodes  that  share  interesting  connectivity 
patterns  over  time.  For  example,  in  a  phonecall  network,  the  nodes  ‘Smith’,  ‘Johnson’,  and 
‘Tompson’  who  call  each  other  every  Sunday  form  a  periodic  clique  (temporal  structure). 

Conducting  the  clustering  by  naively  comparing  each  static  structure  in  3  to  the  others 
will  produce  the  desired  result,  but  is  quadratic  on  the  number  of  static  structures  and  is  thus 
undesirable  from  a  scalability  point  of  view.  Instead,  we  propose  an  incremental  approach  using 
repeated  rank-1  Singular  Value  Decomposition  (SVD)  for  clustering  the  static  structures,  which 
offers  linear  time  complexity  on  the  number  of  edges  m  in  G. 

Matrix  definitions  for  clustering  We  first  define  the  matrices  that  we  will  use  to  cluster  the 
static  structures. 

Definition  5.  [SNMM]  The  structure-node  membership  matrix  (SNMM),  B,  is  a  |3j  x  |V|  matrix, 
where  Bbj  indicates  whether  the  ith  row  (structure)  in  3  (B)  contains  node  j  in  its  node-set. 
Thus,  B  is  a  matrix  indicating  the  membership  of  the  nodes  in  V  to  each  of  the  static  structures 
in  3. _ 

We  note  that  any  two  equivalent  rows  in  B  are  characterized  by  structures  that  share  the  same 
node-set  (but  possibly  different  static  identifiers) .  As  our  clustering  criteria  mandate  that  we  only 
cluster  structures  with  the  same  or  similar  static  identifiers,  in  our  algorithm,  we  construct  four 
SNMMs  -  Bst,  Bci,  Bbc  and  Bc)v  corresponding  to  the  associated  matrices  for  stars,  near  and  full 
cliques,  near  and  full  bipartite  cores  and  chains,  respectively.  Now,  any  two  equivalent  rows  in 
Bcv  are  characterized  by  structures  that  share  the  same  node-set,  and  the  same  or  similar  static 
identifiers  (full  or  near-clique),  and  analogue  for  the  other  matrices.  Next,  we  utilize  SVD  to 
cluster  the  rows  in  each  SNMM,  effectively  clustering  the  structures  in  3. 

Clustering  with  SVD  We  first  give  the  definition  of  SVD,  and  then  describe  how  we  can  use  it 
to  cluster  the  static  structures  and  discover  temporal  structures. 

Definition  6.  [SVD]  The  rank-k  SVD  of  an  m  x  n  matrix  A  factorizes  it  into  3  matrices:  the  m  x  k 
matrix  of  left-singular  vectors  U,  the  k  x  k  diagonal  matrix  of  singular  values  L  and  the  n  x  k 
matrix  of  right-singular  vectors  V,  such  that  A  =  UZVT. 

A  rank-k  SVD  effectively  reduces  the  input  data  to  the  best  k-dimensional  representation,  each 
of  which  can  be  mined  separately  for  clustering  and  community  detection  purposes.  However, 
one  major  issue  with  using  SVD  in  this  fashion  is  that  identifying  the  desired  number  of  clusters  k 
upfront  is  a  non-trivial  task.  To  this  end,  [  ]  evidences  that  in  cases  where  the  input  matrix 
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is  sparse,  repeatedly  clustering  using  k  rank-1  decompositions  and  adjusting  the  input  matrix 
accordingly  approximates  the  batch  rank-k  decomposition.  This  is  a  valuable  result  in  our  case.  As 
we  do  not  initially  know  the  number  of  clusters  needed  to  group  the  structures  in  1,  we  eliminate 
the  need  to  define  k  altogether  by  repeatedly  applying  rank-1  SVD  using  power  iteration  and 
removing  the  discovered  clusters  from  each  SNMM  until  all  clusters  have  been  found  (when  all 
SNMMs  are  fully  sparse  and  thus  deflated) .  However,  in  practice,  full  deflation  is  unnecessary  for 
summarization  purposes,  as  the  most  dominant  clusters  are  found  in  early  iterations  due  to  the 
nature  of  SVD.  For  each  of  the  SNMMs,  the  matrix  B  used  in  the  (i  +  l)tH  iteration  of  this  iterative 
process  is  computed  as 

Bi+1  =Bi-  ISt  oB1 

where  St  denotes  the  set  of  row  IDs  corresponding  to  the  structures  which  were  clustered  together 
in  iteration  i,  ISi  denotes  the  indicator  matrix  with  Is  in  rows  specified  by  Sr  and  o  denotes  the 
Hadamard  matrix  product.  This  update  to  B  is  needed  between  iterations,  as  without  subtracting 
the  previously- found  cluster,  repeated  rank-1  decompositions  would  find  the  same  cluster  ad 
infinitum,  and  the  algorithm  would  not  converge. 

Although  this  algorithm  works  assuming  that  we  can  remove  a  cluster  in  each  iteration,  the 
question  of  how  we  find  this  cluster  given  a  singular  vector  has  yet  to  be  answered.  First,  we 
sort  the  singular  vector,  permuting  the  rows  by  magnitude  of  projection.  The  intuition  is  that 
the  structure  (rows)  which  projects  most  strongly  to  that  cluster  is  the  best  representation  of  the 
cluster,  and  is  considered  a  base  structure  which  we  attempt  to  find  matches  for.  Starting  from 
the  base  structure,  we  iterate  down  the  sorted  list  and  compute  the  Jaccard  similarity,  defined  as 
J (<Ci,  £2)  =  for  node-sets  and  £2,  between  each  structure  and  the  base.  Other  structures 

which  are  composed  of  the  same,  or  similar  node-sets  will  also  project  strongly  to  the  cluster,  and 
will  be  stitched  to  the  base.  Once  we  encounter  a  series  of  structures  which  fail  to  match  by  a 
predefined  similarity  criterion,  we  adjust  the  SNMM  and  continue  with  the  next  iteration. 

Having  stitched  together  the  relevant  static  structures,  we  label  each  temporal  structure  using 
the  temporal  signature  in  A  and  resulting  phrase  in  <D  which  minimizes  its  encoding  cost  using 
the  temporal  encoding  framework  derived  in  Section  6.1.2.  We  use  these  temporal  structures  to 
populate  the  candidate  set  £  for  our  model. 


6.2.4  Composing  the  Summary 

Given  the  candidate  set  of  temporal  structures  £,  we  next  seek  to  find  the  model  M  which  best 
summarizes  G .  However,  actually  finding  the  best  model  is  combinatorial,  as  it  involves  considering 
all  possible  permutations  of  subsets  of  £  and  choosing  the  one  which  gives  the  smallest  encoding 
cost.  As  a  result,  we  propose  several  heuristics  that  give  fast  and  approximate  solutions  without 
entertaining  the  entire  search  space.  As  in  the  case  of  static  graphs  in  Chapter  3,  to  reduce  the 
search  space,  we  associate  a  metric  with  each  temporal  structure  by  which  we  measure  quality, 
called  the  local  encoding  benefit.  The  local  encoding  benefit  is  defined  as  the  ratio  between  the 
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cost  of  encoding  the  given  temporal  structure  as  error,  and  the  cost  of  encoding  it  using  the  best 
phrase  (local  encoding  cost).  Large  local  encoding  benefits  indicate  high  compressibility,  and  thus 
an  easy-to-describe  structure  in  the  underlying  data.  We  use  the  same  heuristics  we  used  in  static 
graph  summarization  (Chapter  3) : 

Plain  :  This  is  the  baseline  approach,  in  which  our  summary  contains  all  the  structures  from  the 
candidate  set,  or  M  =  C. 

Top-k:  In  this  approach,  M  consists  of  the  top  k  structures  of  C,  sorted  by  local  encoding  benefit. 
Greedy’nForget:  This  approach  involves  considering  each  structure  of  G,  sorted  by  local  encod¬ 
ing  benefit,  and  adding  it  to  M  if  the  global  encoding  cost  decreases.  If  adding  the  structure  to  M 
increases  the  global  encoding  cost,  the  structure  is  discarded  as  redundant  or  not  worthwhile  for 
summarization  purposes. 

In  practice,  TimeCrunch  uses  each  of  the  heuristics  and  identifies  the  best  summary  for  G  as 
the  one  that  produces  the  minimum  encoding  cost. 


6.3  Experiments 

In  this  section,  we  evaluate  TimeCrunch  and  seek  to  answer  the  following  questions:  Are  real- 
world  dynamic  graphs  well-structured,  or  noisy  and  indescribable?  If  they  are  structured,  what 
temporal  structures  do  we  see  in  these  graphs  and  what  do  they  mean?  Lastly,  is  TimeCrunch 
scalable? 

6.3.1  Datasets  and  Experimental  Setup 

For  our  experiments,  we  use  5  real  dynamic  graph  datasets,  which  are  summarized  in  Table  6.3 
and  described  below. 

Enron:  The  Enron  e-mail  dataset  is  publicly  available.  It  contains  20  thousand  unique  links 
between  151  users  based  on  e-mail  correspondence  over  163  weeks  (May  1999  -  June  2002). 

Yahoo!  IM:  The  Yahoo-IM  dataset  is  publicly  available.  It  contains  2.1  million  sender-receiver 
pairs  between  100  thousand  users  over  5709  zip-codes  selected  from  the  Yahoo!  messenger 
network  over  4  weeks  starting  from  April  1st,  2008. 

Table  6.3:  Dynamic  graphs  used  for  empirical  analysis 


Graph 

Nodes 

Edges 

Timesteps 

Enron  [SAG  ] 

151 

20000 

163  weeks 

Yahoo-IM  [  ih] 

100000 

2.1  million 

4  weeks 

Honeynet 

372000 

7.1  million 

32  days 

DBLP  [DBL14] 

1.3  million 

15  million 

25  years 

Phonecall 

6.3  million 

36.3  million 

31  days 
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Honeynet:  The  Honeynet  dataset  contains  information  about  network  attacks  on  honeypots  (i.e., 
computers  which  are  left  intentionally  vulnerable  to  attackers)  It  contains  source  IP,  destination  IP 
and  attack  timestamps  of  372  thousand  (attacker  and  honeypot)  machines  with  7.1  million  unique 
daily  attacks  over  a  span  of  32  days  starting  from  December  31st,  2013. 

DBLP:  The  dblp  computer  science  bibliography  is  publicly  available,  and  contains  yearly  co¬ 
authorship  information,  indicating  joint  publication.  We  used  a  subset  of  DBLP  spanning  25  years, 
from  1990  to  2014,  with  1.3  million  authors  and  15  million  unique  author-author  collaborations 
over  the  years. 

Phonecall:  The  Phonecall  dataset  describes  the  who-calls-whom  activity  of  6.3  million  individu¬ 
als  from  a  large,  anonymous  Asian  city  and  contains  a  total  of  36.3  million  unique  daily  phonecalls. 
It  spans  31  days,  starting  from  December  1st,  2007. 

In  our  experiments,  we  use  SlashBurn  [KF1  ]  for  generating  candidate  static  structures,  as  it 
is  scalable  and  designed  to  extract  structure  from  real-world,  non-caveman  graphs2.  We  note  that, 
thanks  to  our  MDL  approach,  the  inclusion  of  additional  graph  decomposition  methods  will  only 
improve  the  results.  Furthermore,  when  clustering  each  sorted  singular  vector  during  the  stitching 
process,  we  move  on  with  the  next  iteration  of  matrix  deflation  after  10  failed  matches  with  a 
Jaccard  similarity  threshold  of  0.5  -  we  choose  0.5  based  on  experimental  results  which  show  that 
it  gives  the  best  encoding  cost  and  balances  between  excessively  terse  and  overlong  (error-prone) 
models.  Lastly,  we  run  TimeCrunch  for  a  total  of  5000  iterations  for  all  graphs  (each  iteration 
uniformly  selects  one  SNMM  to  mine,  resulting  in  5000  total  temporal  structures),  except  for  the 
Enron  graph  which  is  fully  deflated  after  563  iterations  and  the  Phonecall  graph  which  we 
limit  to  1  000  iterations  for  efficiency. 


6.3.2  Quantitative  Analysis 

In  this  section,  we  use  TimeCrunch  to  summarize  each  of  the  real-world  dynamic  graphs  from 
Table  6.3  and  report  the  resulting  encoding  costs.  Specifically,  the  evaluation  is  done  by  comparing 
the  compression  ratio  between  the  encoding  costs  of  the  resulting  models  to  the  null  encoding 
(Original)  cost,  which  is  obtained  by  encoding  the  graph  using  an  empty  model. 

We  note  that  although  we  provide  results  in  a  compression  context,  as  in  the  case  of  static 
graph  summarization,  compression  is  not  our  main  goal  for  TimeCrunch,  but  rather  the  means  to 
our  end  for  identifying  suitable  structures  with  which  to  summarize  dynamic  graphs  and  route  the 
attention  of  practitioners.  For  this  reason,  we  do  not  evaluate  against  other,  compression-oriented 
methods  which  prioritize  leveraging  any  correlation  within  the  data  to  reduce  cost  and  save  bits. 
Other  temporal  clustering  and  community  detection  approaches  which  focus  only  on  extracting 
dense  blocks  are  also  not  compared  to  our  method  for  similar  reasons. 

2  A  caveman  graph  arises  by  modifying  a  set  of  fully  connected  clusters  (caves)  by  removing  one  edge  from 
each  cluster  and  using  it  to  connect  to  a  neighboring  one  such  that  the  clusters  form  a  single  loop  [Wat99]. 
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Table  6.4:  TimeCrunch  finds  temporal  structures  that  can  compress  real  graphs.  Original  denotes 
the  cost  in  bits  for  encoding  each  graph  with  an  empty  model.  Columns  under  TimeCrunch  show 
relative  costs  for  encoding  the  graphs  using  the  respective  heuristic  (size  of  model  is  parenthesized). 
The  lowest  description  cost  is  bolded. 


Graph 

Original 

(bits) 

TimeCrunch 

Vanilla 

TopIO 

Top- 100 

Greedy’nForget 

Enron 

86102 

89%  (563) 

88% 

81% 

78%  (130) 

Yahoo-IM 

16173  388 

97%  (5000) 

99% 

98% 

93%  (1523) 

Honeynet 

72  081235 

82%  (5000) 

96% 

89% 

81%  (3740) 

DBLP 

167831004 

97%  (5000) 

99% 

99% 

96%  (1627) 

Phonecall 

478377701 

100%  (1000) 

100% 

99% 

98%  (370) 

Encoding  Cost  vs.  Model  Size 
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GO 
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O) 
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o 
c 
LU 
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Number  of  Structures  in  Model 


Figure  6.2:  TimeCrunch-Greedy’nForget  summarizes  Enron  using  just  78%  of  Original’s  bits  and 
130  structures  compared  to  89%  and  563  structures  of  TimeCrunch- Vanilla  by  pruning  unhelpful 
structures  from  the  candidate  set. 


In  our  evaluation,  we  consider  (a)  Original  and  (b)  TimeCrunch  summarization  using  the 
proposed  heuristics.  In  the  Original  approach,  the  entire  adjacency  tensor  is  encoded  using 
the  empty  model  M  =  0.  As  the  empty  model  does  not  describe  any  part  of  the  graph,  all  the 
edges  are  encoded  using  L(E“).  We  use  this  as  a  baseline  to  evaluate  the  savings  attainable  using 
TimeCrunch.  For  summarization  using  TimeCrunch,  we  apply  the  Vanilla,  TopIO,  Top-100 
and  Greedy’nForget  model  selection  heuristics.  We  note  that  we  ignore  very  small  structures  of 
less  than  5  nodes  for  Enron  and  less  than  8  nodes  for  the  other,  larger  datasets. 

Table  6.4  shows  the  results  of  our  experiments  in  terms  of  the  encoding  costs  of  various 
summarization  techniques  as  compared  to  the  Original  approach.  Smaller  compression  ratios 
indicate  better  summaries,  with  more  structure  explained  by  the  respective  models.  For  example, 
Greedy’nForget  was  able  to  encode  the  Enron  dataset  using  just  78%  of  the  bits  compared  to 
89%  using  Vanilla.  In  our  experiments,  we  find  that  the  Greedy’nForget  heuristic  produces 
models  with  considerably  fewer  structures  than  Vanilla,  while  giving  even  more  concise  graph 
summaries  (Figure  6.2).  This  is  because  it  is  highly  effective  in  pruning  redundant,  overlapping  or 
error-prone  structures  from  the  candidate  set  C,  by  evaluating  new  structures  in  the  context  of 
previously  seen  ones. 
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Observation  10.  Real-world  dynamic  graphs  are  structured.  TimeCrunch  gives  a  better 
encoding  cost  than  Original,  indicating  the  presence  of  a  temporal  graph  structure. 


6.3.3  Qualitative  Analysis 

In  this  section,  we  discuss  qualitative  results  from  applying  TimeCrunch  to  the  graphs  mentioned 
in  Table  6.3. 


(a)  8  employees  of  the  Enron  le¬ 
gal  team  forming  a  flickering  near 
clique. 


(b)  10  employees  of  the  Enron 
legal  team  forming  a  flickering  star 
with  the  boss  as  the  hub. 


40  | 

Nod®  IDs 


(c)  40  users  in  Yahoo-IM  form¬ 
ing  a  constant  near  clique  with 
55%  density  over  the  observed 
4  weeks. 


(d)  82  users  in  Yahoo-IM  form¬ 
ing  a  constant  star  over  the  ob¬ 
served  4  weeks. 


(e)  589  honeypot  machines  were 
attacked  on  Honeynet  over  2 
weeks,  forming  a  ranged  star. 


(f)  43  authors  that  publish  to¬ 
gether  in  biotechnology  journals 
forming  a  ranged  near  clique  on 

DBLP. 


(g)  82  authors  forming  a  ranged 
near  clique  on  DBLP,  with 
burgeoning  collaboration  from 
timesteps  18-20  (2007-2009). 


(h)  111  callers  in  Phone  call 
forming  a  periodic  star  appearing 
strongly  on  odd  numbered  days, 
especially  Dec.  25  and  31. 


(i)  792  callers  inPhonecall  form¬ 
ing  a  oneshot  near  bipartite  core  ap¬ 
pearing  strongly  on  Dec.  31. 


Figure  6.3:  TimeCrunch  finds  meaningful  temporal  structures  in  real  graphs.  We  show  the  re¬ 
ordered  subgraph  adjacency  matrices  over  multiple  timesteps.  Individual  timesteps  are  outlined  in 
gray,  and  edges  are  plotted  with  alternating  red  and  blue  color  for  discernibility. 


110 


Enron:  The  Enron  graph  is  mainly  characterized  by  many  periodic,  ranged  and  oneshot  stars  and 
several  periodic  and  flickering  cliques.  Periodicity  is  reflective  of  office  e-mail  communications 
(e.g.,  meetings,  reminders).  Figure  6.3(a)  shows  an  excerpt  from  one  flickering  clique  which 
corresponds  to  several  members  of  Enron’s  legal  team,  including  Tana  Jones,  Susan  Bailey,  Marie 
Heard  and  Carol  Clair  -  all  lawyers  at  Enron.  Figure  6.3(b)  shows  an  excerpt  from  a  flickering  star, 
corresponding  to  many  of  the  same  members  as  the  flickering  clique  -  the  center  of  this  star  was 
identified  as  the  boss,  Tana  Jones  (Enron’s  Senior  Legal  Specialist)  -  note  the  vertical  points  above 
node  1  correspond  to  the  satellites  of  the  star  and  oscillate  over  time.  Interestingly,  the  flickering 
star  and  clique  extend  over  most  of  the  observed  duration.  Furthermore,  several  of  the  oneshot 
stars  corresponds  to  company-wide  emails  sent  out  by  key  players  John  Lavorato  (CEO  of  Enron 
America),  Sally  Beck  (COO)  and  Kenneth  Lay  (CEO/Chairman). 

Yahoo!  IM:  The  Yahoo-IM  graph  is  composed  of  many  temporal  stars  and  cliques  of  all  types, 
and  several  smaller  bipartite  cores  with  just  a  few  members  on  one  side  (indicative  of  friends 
who  share  mostly  similar  friend-groups  but  are  themselves  unconnected).  We  observe  several 
interesting  patterns  in  this  data.  Figure  6.3(d)  corresponds  to  a  constant  star  with  a  hub  that 
communicates  with  70  users  consistently  over  4  weeks.  We  suspect  that  these  users  are  part  of 
a  small  office  network,  where  the  boss  uses  group  messaging  to  notify  employees  of  important 
updates  or  events  -  we  notice  that  very  few  edges  of  the  star  are  missing  each  week  and  the 
average  degree  of  the  satellites  is  roughly  4,  corresponding  to  possible  communication  between 
employees.  Figure  6.3(c)  depicts  a  constant  clique  between  40  users,  with  an  average  density 
over  55%  -  we  suspect  that  these  may  be  spam-bots  messaging  each  other  in  an  effort  to  appear 
normal,  or  a  large  group  of  friends  with  multiple  message  groups.  Due  to  lack  of  ground-truth,  we 
cannot  verify. 

Honeynet:  Honeynet  is  a  bipartite  graph  between  attacker  and  honeypot  (victim)  machines.  As 
such,  it  is  characterized  by  temporal  stars  and  bipartite  cores.  Many  of  the  attacks  only  span  a 
single  day,  as  indicated  by  the  presence  of  3512  oneshot  stars,  and  no  attacks  span  the  entire  32 
day  duration.  Interestingly,  2502  of  these  oneshot  star  attacks  (71%)  occur  on  the  first  and  second 
observed  days  (Dec.  31st  and  Jan.  1st)  indicating  intentional  “new-year”  attacks.  Figure  6.3(e) 
shows  a  ranged  star,  lasting  15  consecutive  days  and  targeting  589  machines  for  the  entire  duration 
of  the  attack  (node  1  is  the  hub  of  the  star  and  the  remainder  are  satellites) . 

DBLP:  Agreeing  with  intuition,  dblp  consists  of  a  large  number  of  oneshot  temporal  structures 
corresponding  to  many  single  instances  of  joint  publication.  However,  we  also  find  numerous 
ranged/periodic  stars  and  cliques  which  indicate  coauthors  publishing  in  consecutive  years  or 
intermittently.  Figure  6.3(f)  shows  a  ranged  clique  spanning  from  2007-2012  between  43  coauthors 
who  jointly  published  each  year.  The  authors  are  mostly  members  of  the  NIH  NCBI  (National 
Institute  of  Health  National  Center  for  Biotechnology  Information)  and  have  published  their  work 
in  various  biotechnology  journals,  such  as  Nature,  Nucleic  Acids  Research  and  Genome  Research. 
Figure  6.3(g)  shows  another  ranged  clique  from  2005  to  2011,  consisting  of  83  coauthors  who 
jointly  publish  each  year,  with  an  especially  collaborative  3  years  (timesteps  18-20)  corresponding 
to  2007-2009  before  returning  to  status  quo. 
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Phonecall:  The  Phonecall  dataset  is  largely  comprised  of  temporal  stars  and  few  dense  clique 
and  bipartite  structures.  Again,  we  have  a  large  proportion  of  oneshot  stars  which  occur  only 
at  single  timesteps.  Further  analyzing  these  results,  we  find  that  111  of  the  187  oneshot  stars 
(59%)  are  found  on  Dec.  24th,  25th  and  31st,  corresponding  to  Christmas  Eve/Day  and  New 
Year’s  Eve  holiday  greetings.  Furthermore,  we  find  many  periodic  and  flickering  stars  typically 
consisting  of  50-150  nodes,  which  may  be  associated  with  businesses  regularly  contacting  their 
clientele,  or  public  phones  which  are  used  consistently  by  the  same  individuals.  Figure  6.3(h) 
shows  one  such  periodic  star  of  111  users  over  the  last  week  of  December,  with  particularly  clear 
star  structure  on  Dec.  25th  and  31st  and  other  odd-numbered  days,  accompanied  by  substantially 
weaker  star  structure  on  the  even-numbered  days.  Figure  6.3(i)  shows  an  oddly  well-separated 
oneshot  near-bipartite  core  which  appears  on  Dec.  31st,  consisting  of  two  roughly  equal-sized 
parts  of  402  and  390  callers.  Though  we  do  not  have  ground  truth  to  interpret  these  structures, 
we  note  that  a  practitioner  with  the  appropriate  information  could  better  interpret  their  meaning. 


6.3.4  Scalability 

All  components  of  TimeCrunch  (candidate  subgraph  generation,  static  subgraph  labeling,  tempo¬ 
ral  stitching  and  summary  composition)  are  carefully  designed  to  be  near-linear  on  the  number 
of  edges.  Figure  6.4  shows  the  O(m)  runtime  of  TimeCrunch  on  several  induced  temporal 
subgraphs  (up  to  14M  edges)  taken  from  the  DBLP  dataset  at  varying  time-intervals.  We  ran 
the  experiments  using  a  machine  with  80  Intel  Xeon(R)  4850  2GHz  cores  and  256GB  RAM.  We 
use  MATLAB  for  candidate  subgraph  generation  and  temporal  stitching,  and  Python  for  model 
selection  heuristics. 


Runtime  vs.  Data  Size 


Number  of  Edges  (size  of  data) 


Figure  6.4:  TimeCrunch  scales  near-linearly  on  the  number  of  edges  in  the  graph.  Here,  we  use 
several  induced  temporal  subgraphs  from  DBLP,  up  to  14M  edges  in  size. 
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6.4  Related  Work 


The  related  work  falls  into  three  main  categories:  static  graph  mining,  temporal  graph  mining, 
and  graph  compression  and  summarization.  Table  6.1  gives  a  visual  comparison  of  TimeCrunch 
with  existing  methods. 

Static  Graph  Mining.  Most  works  find  specific,  tightly-knit  structures,  such  as  (near-)  cliques 
and  bipartite  cores:  eigendecomposition  [SBG  ],  cross-associations  [  ],  and  modularity- 

based  optimization  methods  [  'JGCM,  5GLL08].  Dhillon  et  al.  [  DMM03]  propose  information- 
theoretic  co-clustering  based  on  mutual  information  optimization.  However,  these  approaches 
have  limited  vocabularies  and  are  unable  to  find  other  types  of  interesting  structures,  such  as  stars 
or  chains.  [  ,  ]  propose  cut-based  partitioning,  whereas  [  ]  suggests  spectral 

partitioning  using  multiple  eigenvectors  -  these  schemes  seek  hard  clustering  of  all  nodes  as 
opposed  to  identifying  communities,  and  are  not  usually  parameter-free.  Subdue  [  19 ■  ]  and 

other  fast  frequent-subgraph  mining  algorithms  [  ,+05]  operate  on  labeled  graphs.  Our  work 

involves  unlabeled  graphs  and  lossless  compression. 

Temporal  Graph  Mining.  [AP05]  aims  at  change  detection  in  streaming  graphs  using  pro¬ 
jected  clustering.  This  approach  focuses  on  anomaly  detection  rather  than  finding  recurrent 
temporal  patterns.  GraphScore  [SFPY07]  and  Com2  [APG+  lz  ]  use  graph-search  and  PARAFAC 
(or  Canonical  Polyadic  -  CP)  tensor  decomposition  followed  by  MDL  to  find  dense  temporal  cliques 
and  bipartite  cores.  [  L+01  ]  uses  incremental  cross-association  for  change  detection  in  dense 

blocks  over  time,  whereas  [  ]  proposes  an  algorithm  for  mining  cross-graph  quasi-cliques 

(though  not  in  a  temporal  context).  A  probabilistic  approach  based  on  mixed-membership  block- 
models  is  proposed  by  Fu  et  al.  [FSX0(  ].  These  approaches  have  limited  vocabularies  and  do  not 
offer  temporal  interpretability.  Dynamic  clustering  [XKHI  ]  aims  to  find  stable  clusters  over  time 
by  penalizing  deviations  from  incremental  static  clustering.  Our  work  focuses  on  interpretable 
structures,  which  may  not  appear  at  every  timestep. 

Graph  Compression  and  Summarization.  SlashBurn  [KF1  ]  is  a  recursive  node-reor¬ 
dering  approach  to  leverage  run-length  encoding  for  graph  compression.  [TZ  ]  uses  structural 
equivalence  to  collapse  nodes/edges  to  simplify  graph  representation.  These  approaches  do  not 
compress  the  graph  for  pattern  discovery,  nor  do  they  operate  on  dynamic  graphs.  VoG  (Chapter  3, 
[  ,  ])  uses  MDL  to  label  subgraphs  in  terms  of  a  vocabulary  on  static  graphs,  con¬ 

sisting  of  stars,  (near)  cliques,  (near)  bipartite  cores  and  chains.  This  approach  only  applies  to 
static  graphs  and  does  not  offer  a  clear  extension  to  dynamic  graphs.  Our  work  proposes  a  suitable 
lexicon  for  dynamic  graphs,  uses  MDL  to  label  temporally  coherent  subgraphs  and  proposes  an 
effective  and  scalable  algorithm  for  finding  them.  For  more  extensive  related  work  in  this  category, 
refer  also  to  Section  3.6. 
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6.5  Summary 

In  this  work,  we  tackle  the  problem  of  identifying  significant  and  structurally  interpretable  temporal 
patterns  in  large,  dynamic  graphs.  Specifically, 

•  Problem  Formulation:  We  formalize  the  problem  of  finding  important  and  coherent  tem¬ 
poral  structures  in  a  graph  as  minimizing  the  encoding  cost  of  the  graph  from  a  compression 
standpoint. 

•  Effective  and  Scalable  Algorithm:  we  propose  TimeCrunch,  a  fast  and  effective,  incre¬ 
mental  technique  for  building  interpretable  summaries  for  dynamic  graphs  which  involves 
generating  candidate  subgraphs  from  each  static  graph,  labeling  them  using  static  identifiers, 
stitching  them  over  multiple  timesteps  and  composing  a  model  using  practical  approaches. 

•  Experiments  on  Real  Graphs:  we  apply  TimeCrunch  on  several  large,  dynamic  graphs 
and  find  numerous  patterns  and  anomalies  which  indicate  that  real-world  graphs  do  in  fact 
exhibit  temporal  structure. 
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Chapter  based  on  work  that  appeared  at  SDM  2013  [KVF13],  and  will  appear  at  TKDD  [KSV+15]. 


Chapter  7 

Graph  Similarity 


A  question  that  often  comes  up  when  studying  multiple  networks  is:  How  much  do  two  graphs 
or  networks  differ  in  terms  of  connectivity,  and  which  are  the  main  node  and  edge  culprits  for 
their  difference?  For  example,  how  much  has  a  network  changed  since  yesterday?  How  different 
is  the  wiring  of  Bob’s  brain  (a  left-handed  male)  and  Alice’s  brain  (a  right-handed  female),  and 
what  are  their  main  differences? 

Similarity  or  comparison  of  aligned  graphs  (i.e.,  with  known  node  correspondence)  is  a  core 
task  for  sense-making:  abnormal  changes  in  network  traffic  may  indicate  a  computer  attack; 
differences  of  big  extent  in  a  who-calls-whom  graph  may  reveal  a  national  celebration,  or  a 
telecommunication  problem.  Besides,  network  similarity  can  serve  as  a  building  block  for  similarity- 
based  classification  [CGG+09]  of  graphs,  and  give  insights  into  transfer  learning,  as  well  as 
behavioral  patterns:  is  the  Facebook  message  graph  similar  to  the  Facebook  wall-to-wall  graph? 
Tracking  changes  in  networks  over  time,  spotting  anomalies  and  detecting  events  is  a  research 
direction  that  has  attracted  much  interest  (e.g.,  [  1],  [  ],  [  ]). 

Long  in  the  purview  of  researchers,  graph  similarity  has  been  a  well-studied  problem  and 
several  approaches  have  been  proposed  to  solve  variations  of  the  problem.  However,  graph 
comparison  with  node/edge  attribution  still  remains  an  open  problem,  while  (with  the  passage  of 
time)  its  list  of  requirements  increases:  the  exponential  growth  of  graphs,  both  in  number  and  size, 
calls  for  methods  that  are  not  only  accurate,  but  also  scalable  to  graphs  with  billions  of  nodes. 

In  this  chapter,  we  address  three  main  problems:  (i)  How  to  compare  two  networks  efficiently, 
(ii)  how  to  evaluate  the  degree  of  their  similarity,  and  (iii)  how  to  identify  the  culprit  nodes/edges 
responsible  for  the  differences.  Our  main  contributions  are  the  following: 

1.  Axioms  and  Properties:  We  formalize  the  axioms  and  properties  to  which  a  similarity 
measure  must  conform  (  Section  7.1). 

2.  Effective  and  Scalable  Algorithm:  We  propose  DeltaCon  for  measuring  connectivity 
differences  between  two  graphs,  and  show  that  it  is:  (a)  principled,  conforming  to  all 
the  axioms  presented  in  Section  7.1,  (b)  intuitive,  giving  similarity  scores  that  agree  with 
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(a)  Connectome:  neural  net¬ 
work  of  brain. 


Significantly 


Significantly 
lower  CCI 


0,5  O.J 

ward  linkage 


(b)  Dendogram  representing  the  hierarchical  clustering  of  the 
DeltaCon  similarities  between  the  114  connectomes. 


(c)  Brain  graph  of  a  subject  with 
high  creativity  index. 


(d)  Brain  graph  of  a  subject  with 
low  creativity  index. 


Figure  7.1:  DELTACoN-based  clustering  shows  that  artistic  brains  seem  to  have  different  wiring  than 
the  rest,  (a)  Brain  network  (connectome).  Different  colors  correspond  to  each  of  the  70  cortical 
regions,  whose  centers  are  depicted  by  vertices,  (b)  Hierarchical  clustering  using  the  DeltaCon 
similarities  results  in  two  clusters  of  connectomes.  Elements  in  red  correspond  to  mostly  high 
creativity  score,  (c)-(d)  Brain  graphs  for  subjects  with  high  and  low  creativity  index,  respectively. 
The  low-CCI  brain  has  fewer  and  lighter  cross-hemisphere  connections  than  the  high-CCI  brain. 


common  sense  and  can  be  easily  explained,  and  (c)  scalable ,  able  to  handle  large-scale 
graphs.  We  also  introduce  DeltaCon-Attr  for  change  attribution  between  graphs. 

3.  Experiments  on  Real  Graphs:  we  report  experiments  on  synthetic  and  real  datasets,  and 
compare  our  similarity  measure  to  six  state-of-the-art  methods  that  apply  to  our  setting.  We 
also  use  DeltaCon  for  real-world  applications,  such  as  temporal  anomaly  detection  and 
graph  clustering/classification.  In  Figure  7.1,  DeltaCon  is  used  to  cluster  brain  graphs 
corresponding  to  114  individuals;  the  two  big  clusters  which  differ  in  terms  of  connectivity 
correspond  to  people  with  significantly  different  levels  of  creativity.  More  details  are  given 
in  Section  7.5. 
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Table  7.1:  DeltaCon:  Symbols  and  Definitions.  Bold  capital  letters:  matrices;  bold  lowercase  letters: 
vectors;  plain  font:  scalars. 


Symbol 

Description 

G 

graph 

V,n 

set  of  nodes,  number  of  nodes 

£,m 

set  of  edges,  number  of  edges 

sim(  G  i ,  G2) 

similarity  between  graphs  Gi  and  G2 

d(Gi,  G2) 

distance  between  graphs  G|  and  G2 

I 

n  x  n  identity  matrix 

A 

n  x  n  adjacency  matrix  with  elements  a ^ 

D 

n  x  n  diagonal  degree  matrix,  du  =  Y.)  aij 

L 

=  D  —  A  laplacian  matrix 

S 

n  x  n  matrix  of  final  scores  with  elements  syj 

S' 

n  x  g  reduced  matrix  of  final  scores 

et 

n  x  1  unit  vector  with  1  in  the  ith  element 

bhOk 

n  x  1  vector  of  seed  scores  for  group  k 

bhi 

n  x  1  vector  of  final  affinity  scores  to  node  i 

9 

number  of  groups  (node  partitions) 

e 

=  1/(1  +  maxi  (du))  positive  constant  (<  1) 
encoding  the  influence  between  neighbors 

DCo,  DC 

DeltaCono,  DeltaCon 

VEO 

Vertex/Edge  Overlap 

GED 

Graph  Edit  Distance  [  ] 

SS 

Signature  Similarity  [  ’DGM08] 

A-D  Adj. 

A-distance  on  the  Adjacency  A 

A-D  Lap 

A-distance  on  the  Laplacian  L 

A-D  N.L. 

A-distance  on  the  normalized  Laplacian  L 

The  chapter  is  organized  as  follows:  Section  7.1  presents  the  intuition  behind  our  main 
deltacon,  and  the  axioms  and  desired  properties  of  a  similarity  measure;  Section  7.2  and  7.3 
have  the  proposed  algorithms  for  similarity  computation  and  node/edge  attribution,  as  well  as 
theoretical  proofs  for  the  axioms  and  properties;  experiments  on  synthetic  and  large-scale  real 
networks  are  in  Section  7.4;  Section  7.5  presents  three  real-world  applications;  the  related  work 
and  the  conclusions  are  given  in  Section  7.6  and  7.7,  respectively.  Finally,  Table  7.1  presents  the 
major  symbols  we  use  in  the  chapter  and  their  definitions. 
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7.1  Proposed  Method:  Intuition  of  DeltaCon 

How  can  we  find  the  similarity  in  connectivity  between  two  graphs  or,  more  formally,  how  can  we 

solve  the  following  problem? _ 

Problem  Definition  7.  rPELTACoNnectivityl 

Given  (a)  two  graphs,  Gi (V,  £| )  and  G2CV,  £2)  with  the  same  node  set1,  V,  but  different  edge 
sets  £1  and  £2,  and  (b)  the  node  correspondence. 

Find  a  similarity  score,  sim(Gi,  G2)  G  [0, 1],  between  the  input  graphs.  Similarity  score  of  value 
0  means  totally  different  graphs,  while  1  means  identical  graphs. 

The  obvious  way  to  solve  this  problem  is  by  measuring  the  overlap  of  their  edges.  Why  does  this 
often  not  work  in  practice?  Consider  the  following  example.  According  to  the  overlap  method, 
the  pairs  of  barbell  graphs  shown  in  Figure  7.3  of  p.  136,  (BIO,  mBIO)  and  (BIO,  mmBlO),  have 
the  same  similarity  score.  But,  clearly,  from  the  aspect  of  information  flow,  a  missing  edge  from  a 
clique  (mBIO)  does  not  play  as  important  role  in  the  graph  connectivity  as  the  missing  “bridge”  in 
mmBlO.  So,  could  we  instead  measure  the  differences  in  the  1-step  away  neighborhoods,  2-step 
away  neighborhoods  etc.?  If  yes,  with  what  weight?  It  turns  out  that  our  method  does  that  in  a 
principled  way  (Intuition  1,  p.  120). 

7.1.1  Fundamental  Concept 

The  first  conceptual  step  of  our  proposed  method  is  to  compute  the  pairwise  node  affinities  in  the 
first  graph,  and  compare  them  with  the  ones  in  the  second  graph.  For  notational  compactness,  we 
store  them  in  a  n  x  n  similarity  matrix2  S.  The  sq  entry  of  the  matrix  indicates  the  influence  node 
i  has  on  node  j.  For  example,  in  a  who-knows-whom  network,  if  node  i  is,  say,  republican  and  if 
we  assume  homophily  (i.e.,  neighbors  are  similar),  how  likely  is  it  that  node  j  is  also  republican? 
Intuitively,  node  i  has  more  influence/affinity  to  node  j  if  there  are  many  short,  heavily  weighted 
paths  from  node  i  to  j . 

The  second  conceptual  step  is  to  measure  the  differences  in  the  corresponding  node  affinity 
scores  of  the  two  graphs  and  report  the  result  as  their  similarity  score. 

7.1.2  How  to  measure  node  affinity? 

Pagerank  [BP98],  personalized  Random  Walks  with  Restarts  (RWR)  [  lavO^  ],  lazy  RWR  [AF02], 
and  the  “electrical  network  analogy”  technique  [  )S84]  are  only  a  few  of  the  methods  that  compute 
node  affinities.  We  could  have  used  Personalized  RWR:  [I  —  (1  —  c)AD_1]bhi  =  c  et,  where  c  is  the 
probability  of  restarting  the  random  walk  from  the  initial  node,  et  the  starting  (seed)  indicator 
vector  (all  zeros  except  1  at  position  i),  and  bhi  the  unknown  Personalized  Pagerank  column  vector. 

JIf  the  graphs  have  different,  but  overlapping  node  sets,  Vi  and  V2,  we  assume  that  V  =  V\  U  V2,  and 
the  extra  nodes  are  treated  as  singletons. 

2In  reality,  we  don’t  measure  all  the  affinities  (see  Section  7.2.2  for  an  efficient  approximation). 
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Specifically,  sq  is  the  affinity  of  node  j  with  respect  to  node  i.  For  reasons  that  we  explain  next,  we 
chose  to  use  Fast  Belief  Propagation  (FaBP  [KKK 1  ]),  an  inference  method  that  we  introduced 

in  Chapter  4.  Specifically,  we  use  a  simplified  form  of  FaBP  given  in  the  following  lemma: 

Lemma  1. 

FaBP  (Equation  4.4)  can  be  simplified  and  written  as: 

[I+e2D-eA]-bhi=ei,  (7.1) 

where  bhi  =  [su,  ...Sin]T  is  the  column  vector  of  final  similarity/influence  scores  starting 
from  the  ith  node,  e  is  a  small  constant  capturing  the  influence  between  neighboring 
nodes,  I  is  the  identity  matrix,  A  is  the  adjacency  matrix  and  D  is  the  diagonal  matrix 
with  the  degree  of  node  i  as  the  du  entry. 


Proof.  (From  FaBP  to  DeltaCon.)  We  start  from  the  equation  for  FaBP  in  Chapter  4 

[I  +  aD  —  c'AJbh  =  4>h>  (4.4) 

where  4>h  is  the  vector  of  prior  scores,  bh  is  the  vector  of  final  scores  (beliefs),  a  =  4h^/(  1  —  4h^), 
and  c'  =  2h|x/(  l  —  4h^)  are  small  constants,  and  Kh  is  a  small  constant  that  encodes  the  influence 
between  neighboring  nodes  (homophily  factor).  By  using  the  Maclaurin  approximation  for  division 
in  Table  4.4,  we  obtain 

l/(l-4h2)«l+4h2. 

To  obtain  Equation  7.1,  the  core  formula  of  DeltaCon,  we  substitute  the  latter  approximation  in 
Equation  4.4,  and  also  set  c|>n  =  ex,  bh  =  bht,  and  Hh  =  e/2.  ■ 

For  an  equivalent,  more  compact  notation,  we  use  the  matrix  form,  and  stack  all  the  bhx  vectors 
(i  =  1, . . . ,  n)  into  the  n  x  n  matrix  S.  We  can  easily  prove  that 


S  =  [sij]  =  [I  +  e2D  —  eA]_1 


(7.2) 


Equivalence  to  Personalized  RWR.  Before  we  move  on  with  the  intuition  behind  our  method, 
we  note  that  the  version  of  FaBP  that  we  use  (Equation  7.1)  is  identical  to  Personalized  RWR 
under  specific  conditions,  as  shown  in  Theorem  1. 

r-[ Theorem  l7) 

The  FaBP  equation  (Equation  7.1)  can  be  written  in  a  Personalized  RWR-like  form: 

[I  -  (1  -  c")A*D-1]bhi  =  c"  y, 

where  c"  =  1  -  e,  y  =  A*D“1A-1^ei  and  A*  =  D(I  +  e2D)-1D-1AD. 
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Proof.  We  begin  from  the  derived  FaBP  equation  (Equation  7.1)  and  do  simple  linear  algebra 
operations: 

(x  D-1  from  the  left) 
(F  =  D-1  +  e2I) 
(x  F-1  from  the  left) 
( A*  =  F-1D-1AD) 
:e0  ■ 


ei 


[I  +  e2D  -  eA]bhi  =  et 
[D1  +  e2I  -  eD_1A]bht  =  D1' 
[F-eD1A]bhi  =  D1ei 
[I  -  eF_1D_1A]bhi  =  F-iD-1^ 

[I  —  eA*D_1]bhi  =  (1  -e)  (AJ)-^-1-^ 


7.1.3  Why  use  Belief  Propagation? 

The  reasons  we  choose  BP  and  its  fast  approximation  with  Equation  7.2  are:  (a)  it  is  based  on 
sound  theoretical  background  (maximum  likelihood  estimation  on  marginals),  (b)  it  is  fast  (linear 
on  the  number  of  edges),  and  (c)  it  agrees  with  intuition,  taking  into  account  not  only  direct 
neighbors,  but  also  2-,  3-,  and  k-step-away  neighbors,  with  decreasing  weight.  We  elaborate  on 
the  last  reason,  next: 

Intuition  1.  (Attenuating  Neighboring  Influence) 

By  temporarily  ignoring  the  echo  cancellation  term  e2D  in  Equation  7.2,  we  can  expand  the 
matrix  inversion  and  approximate  the  n  x  n  matrix  of  pairwise  affinities,  S,  as 

S  «  [I  —  eA]_1  «  I  +  eA  +  e2A2  +  . . . . 

As  we  said,  our  method  captures  the  differences  in  the  1-step,  2-step,  3-step  etc.  neighborhoods 
in  a  weighted  way;  differences  in  long  paths  have  a  smaller  effect  on  the  computation  of  the 
similarity  measure  than  differences  in  short  paths.  Recall  that  e  <  1,  and  that  Ak  has  information 
about  the  k-step  paths.  Notice  that  this  is  just  the  intuition  behind  our  method;  we  do  not  use  this 
simplified  formula  to  find  matrix  S. 


7.1.4  Which  properties  should  a  similarity  measure  satisfy? 

Let  Gi(V,  £j)  and  G2(V,  £2)  be  two  graphs,  and  sim(Gi,  G2)  €  [0, 1]  denote  their  similarity  score. 
Then,  we  want  the  similarity  measure  to  obey  the  following  axioms: 

•  Al.  Identity  property:  sim(Gi,Gi)  =  1 

•  A2.  Symmetric  property:  sim(Gi,G2)  =  sim(G2,  G 1 ) 

•  A3.  Zero  property:  sim(Gi,  G2)  — >  0  for  n  — >  00,  where  Gi  is  the  complete  graph  (Kn),  and 
G2  is  the  empty  graph  (i.e.,  the  edge  sets  are  complementary). 

Moreover,  the  measure  must  be: 

(a)  intuitive  It  should  satisfy  the  following  desired  properties: 
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PI.  [Edge  Importance ]  For  unweighted  graphs,  changes  that  create  disconnected  components 
should  be  penalized  more  than  changes  that  maintain  the  connectivity  properties  of  the 
graphs. 

P2.  [ Edge-“Submodularity ”]  For  unweighted  graphs,  a  specific  change  is  more  important  in  a 
graph  with  few  edges  than  in  a  much  denser,  but  equally  sized  graph. 

P3.  [Weight  Awareness]  In  weighted  graphs,  the  bigger  the  weight  of  the  removed  edge  is, 
the  greater  the  impact  on  the  similarity  measure  should  be. 

In  Section  7.2.3  we  formalize  the  properties  and  discuss  their  satisfiability  by  our  pro¬ 
posed  similarity  measure  theoretically.  Moreover,  in  Section  7.4  we  introduce  and  discuss 
an  additional,  informal,  property: 

IP.  [Focus  Awareness]  “Random”  changes  in  graphs  are  less  important  than  “targeted”  changes 
of  the  same  extent. 

(b)  scalable  The  huge  size  of  the  generated  graphs,  as  well  as  their  abundance  require  a 
similarity  measure  that  is  computed  fast  and  handles  graphs  with  billions  of  nodes. 


7.2  Proposed  Method:  Details  of  DeltaCon 


Now  that  we  have  described  the  high  level  ideas  behind  our  method,  we  move  on  to  the  details. 

7.2.1  Algorithm  Description 

Let  the  graphs  we  compare  be  Gi(V,  £j)  and  G2(V,  £2)-  If  the  graphs  have  different  node  sets,  say 
Vi  and  V2,  we  assume  that  V  =  Vi  U  V2,  where  some  nodes  are  disconnected. 

As  mentioned  before,  the  main  idea  behind  our  proposed  similarity  algorithm  is  to  compare 
the  node  affinities  in  the  given  graphs.  The  steps  of  our  similarity  method  are: 

Step  1  By  Equation  7.2,  we  compute  for  each  graph  the  n  x  n  matrix  of  pairwise  node  affinity 
scores  (Si  and  S2  for  graphs  Gi  and  G2,  respectively). 

Step  2  Among  the  various  distance  and  similarity  measures  (e.g.,  Euclidean  distance  (ED), 
cosine  similarity,  correlation)  found  in  the  literature,  we  use  the  root  Euclidean  distance  (RootED, 
a.k.a.  Matusita  distance) 


n  n 


(7.3) 
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We  use  the  RootED  distance  for  the  following  reasons: 

1.  it  is  very  similar  to  the  Euclidean  distance  (ED),  the  only  difference  being  the  square  root  of 
the  pairwise  similarities  (sij), 

2.  it  usually  gives  better  results,  because  it  “boosts”  the  node  affinities3  and,  therefore,  detects 
even  small  changes  in  the  graphs  (other  distance  measures,  including  ED,  suffer  from  high 
similarity  scores  no  matter  how  much  the  graphs  differ),  and 

3.  satisfies  the  desired  properties  P1-P3,  as  well  as  the  informal  property  IP.  As  discussed  in 
Section  7.2.3,  at  least  PI  is  not  satisfied  by  the  ED. 

Step  3  For  interpretability,  we  convert  the  distance  (d)  to  a  similarity  measure  (sim)  via  the 
formula  sim  =  The  result  is  bounded  to  the  interval  [0,1],  as  opposed  to  being  unbounded 
[0,oo).  Notice  that  the  distance-to-similarity  transformation  does  not  change  the  ranking  of  results 
in  a  nearest-neighbor  query. 

The  straightforward  algorithm,  DeltaCono  (Algorithm  7.4),  is  to  compute  all  the  n2  affinity 
scores  of  matrix  S  by  simply  using  Equation  7.2.  We  can  do  the  inversion  using  the  Power  Method 
or  any  other  efficient  method. 

Algorithm  7.4  DeltaCon0 


INPUT:  edge  files  of  Gi(V,£i)  and  G2(V,£2) 

//  V  =  V\  U  V2,  if  V |  and  V2  are  the  graphs’  node  sets 

51  =  [I  +  e2D!  -eAi]"1 

52  =  [I  +  e2D2  —  eA2]_1 
d(Gi,  G2)  =RootED  (Si,S2) 


//  S|,tj :  affinity/influence  of 
//node  I  to  node  j  in  G  i 


RETURN:  sim(Gi,G2)  = 


l  +  d(  G  i,G2) 


7.2.2  Speeding  up:  DeltaCon 

DeltaCono  satisfies  all  the  properties  in  Section  7.1,  but  it  is  quadratic  (ri2  affinity  scores  sq  are 
computed  by  using  the  power  method  for  the  inversion  of  a  sparse  matrix)  and  thus  not  scalable. 
We  present  a  faster,  linear  algorithm,  DeltaCon  (Algorithm  7.5),  which  approximates  DeltaCono 
and  differs  in  the  first  step.  We  still  want  each  node  to  become  a  seed  exactly  once  in  order  to 
find  the  affinities  of  the  rest  of  the  nodes  to  it;  but  here  we  have  multiple  seeds  at  once,  instead  of 
having  one  seed  at  a  time.  The  idea  is  to  randomly  divide  our  node-set  into  g  groups,  and  compute 
the  affinity  score  of  each  node  i  to  group  k,  thus  requiring  only  nxg  scores,  which  are  stored  in 
the  rtxg  matrix  S'  (g  <c  n).  Intuitively,  instead  of  using  the  n  x  n  affinity  matrix  S,  we  add  up 
the  scores  of  the  columns  that  correspond  to  the  nodes  of  a  group,  and  obtain  the  nxg  matrix 

3The  node  affinities  are  in  [0, 1],  so  the  square  root  makes  them  bigger. 
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S'  (g  <c  n).  The  score  s(k  is  the  affinity  of  node  i  to  the  kth  group  of  nodes  (k  =  1, . . . ,  g).  The 
following  lemma  gives  the  complexity  of  computing  the  node-group  affinities. 

Lemma  2. 

The  time  complexity  of  computing  the  reduced  affinity  matrix.  S',  is  linear  on  the  number 
of  edges. 


Proof.  We  can  compute  the  n  x  g  “skinny”  matrix  S'  quickly,  by  solving  [I  +  e2D  —  eA]S'  = 
[bhoi  •  •  -bhogL  where  bhok  =  Iiegraupi  et  is  the  membership  n  x  1  vector  for  group  k  (all  Os, 
except  Is  for  members  of  the  group).  Solving  this  system  is  equivalent  to  solving  for  each  group 
g  the  linear  system  [I  +  e2D  —  eA]S'  =  bhog-  Using  the  power  method  (Section  4.4),  the  linear 
system  can  be  solved  in  time  linear  on  the  number  of  non-zeros  of  the  matrix  eA  —  e2D,  which  is 
equivalent  to  the  number  of  edges,  m,  of  the  input  graph  G.  Thus,  the  g  linear  systems  require 
0(g  •  m)  time,  which  is  still  linear  on  the  number  of  edges  for  a  small  constant  g.  It  is  worth  noting 
that  the  g  linear  systems  can  be  solved  in  parallel,  since  there  are  no  dependencies,  and  then  the 
overall  time  is  simply  O(m).  ■ 

Thus,  we  compute  g  final  scores  per  node,  which  denote  its  affinity  to  every  group  of  seeds, 
instead  of  every  seed  node  that  we  had  in  Equation  7.2.  With  careful  implementation,  DeltaCon 
is  linear  on  the  number  of  edges  and  groups  g.  As  we  show  in  Section  7.4.3,  it  takes  ~  160sec,  on 
commodity  hardware,  for  a  1.6-million- node  graph.  Once  we  have  the  reduced  affinity  matrices  S{ 
and  S2  of  the  two  graphs,  we  use  the  RootED,  to  find  the  similarity  between  the  nxg  matrices 
of  final  scores,  where  g«n.  The  pseudocode  of  the  DeltaCon  is  given  in  Algorithm  7.5. 


Algorithm  7.5  DeltaCon 

INPUT:  edge  files  of  Gi(V,£i)  and  G2(V,£2) 
g  (groups:  #  of  node  partitions) 

and 

{Vj}?=|  =  random_partition(V,  g) 

//  estimate  affinity  vector  of  nodes  i  =  1, . . . 
for  k  =  1  -x  g  do 

^hk  =  XieVk  ei 

solve  [I  +  e2Di  -  eAi]bhl'k  =  4>hk 
solve  [I  +  e2D2  -  eA2]bh2k  =  (J>nk 
end  for 

,  n  to  group  k 

//g  groups 

Sj  =  [bhn  bhi2  •  •  •  bhigb  S2  =  [bh2i  bh22  •  • 
//  compare  affinity  matrices  S{  and  S2 
d(Gi,G2)  =RootED  (S{,S^ 

RETURN:  sim(Gi,G2)  =  1+d(G,,G2) 

•  bhy 
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In  an  attempt  to  see  how  our  random  node  partitioning  algorithm  in  the  first  step  fares  with 
respect  to  more  principled  partitioning  techniques,  we  used  METIS  [KK95].  Essentially,  such  an 
approach  finds  the  influence  of  coherent  subgraphs  to  the  rest  of  the  nodes  in  the  graph  -  instead 
of  the  influence  of  randomly  chosen  nodes  to  the  latter.  We  found  that  the  METIS -based  variant  of 
our  similarity  method  gave  intuitive  results  for  most  small,  synthetic  graphs,  but  not  for  the  real 
graphs.  This  is  probably  related  to  the  lack  of  good  edge-cuts  on  sparse  real  graphs,  and  also  the 
fact  that  changes  within  a  group  manifest  less  when  a  group  consists  of  the  nodes  belonging  to  a 
single  community  than  randomly  assigned  nodes. 

Next  we  give  the  time  complexity  of  DeltaCon,  as  well  as  the  relationship  between  the 
similarity  scores  of  DeltaCono  and  DeltaCon. 

Lemma  3. 

The  time  complexity  of  DeltaCon,  when  applied  in  parallel  to  the  input  graphs,  is  linear 
on  the  number  of  edges  in  the  graphs,  i.e.,  0(g  •  max{rrii,  m2}). 


Proof.  By  using  the  power  method  (Section  4.4),  the  complexity  of  solving  Equation  7.1  is  O(rrii) 
for  each  graph  (i  =  1,2).  The  node  partitioning  needs  O(n)  time;  the  affinity  algorithm  is  run  g 
times  in  each  graph,  and  the  similarity  score  is  computed  in  O(gn)  time.  Therefore,  the  complexity 
of  DeltaCon  is  0((g  +  l)n  +  g(mi  +1112)),  where  g  is  a  small  constant.  Unless  the  graphs  are  trees, 
|£il  <  n,  so  the  complexity  of  the  algorithm  reduces  to  0(g(mi  +  m2)).  Assuming  that  the  affinity 
algorithm  is  run  on  the  graphs  in  parallel,  since  there  is  no  dependency  between  the  computations, 
DeltaCon  has  complexity  0(g  •  max{mi,  m2}).  ■ 

Before  we  give  the  relationship  between  the  similarity  scores  computed  by  the  two  proposed 
methods,  we  introduce  a  helpful  lemma. 

Lemma  4. 

The  affinity  score  of  each  node  to  a  group  (computed  by  DeltaCon)  is  equal  to  the 
sum  of  the  affinity  scores  of  the  node  to  each  one  of  the  nodes  in  the  group  individually 
(computed  by  DeltaCon0). 


Proof.  Let  B  =  I  +  e2D  —  eA.  Then  DeltaCono  consists  of  solving  for  every  node  i  €  V  the 
equation  B  •  bhi  =  ep;  DeltaCon  solves  the  equation  B  •  bh{.  =  <|>hk  for  all  groups  k  €  (0,  g], 
where  (J>nk  =  H  ie  group,,  ei-  Because  of  the  linearity  of  matrix  additions,  it  holds  true  that 
bhk  =  LiegrouPk  bM,  for  all  groups  k.  ■ 

f-{ Theorem  2.] 

DeltaCon’s  similarity  score  between  any  two  graphs  Gi,  G2  upper  bounds  the  actual 
DeltaCono’s  similarity  score,  i.e.,  simDc0(Gi,  G2)  ^  simDc(Gi,  G2). 
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Proof.  Intuitively,  grouping  nodes  blurs  the  influence  information  and  makes  the  nodes  seem  more 
similar  than  originally. 

More  formally,  let  Si,  S2  be  the  n  x  n  final  score  matrices  of  Gi  and  G2  by  applying  DeltaCono, 
and  S{,  S2  be  the  respective  nxg  final  score  matrices  by  applying  DeltaCon.  We  want  to  show 
that  DeltaCono’s  distance 

doc0  =  \Jy.  r=i 

is  greater  than  DeltaCon’s  distance 


or,  equivalently,  that  d^> C  )  >  d qC.  It  is  sufficient  to  show  that  for  one  group  of  DeltaCon,  the 
corresponding  summands  in  do  c  are  smaller  than  the  summands  in  do  c0  which  are  related  to 
the  nodes  that  belong  to  the  group.  By  extracting  the  terms  in  the  squared  distances  that  refer 
to  one  group  of  DeltaCon  and  its  member  nodes  in  DeltaCono,  and  by  applying  Lemma  4,  we 
obtain  the  following  terms: 

C0  =  ^x=l  ^jegroit — 

^DC  ^.i=l  ('^/^.jGgroup  ^1,1)  y^^jGgroitp  ®2,ij )  • 

Next  we  concentrate  again  on  a  selection  of  summands  (e.g.,  i  =  1),  we  expand  the  squares  and 
use  the  Cauchy-Schwartz  inequality  to  show  that 

^jGgroitp  V^Ltj ^  ^/^jGgroitp  ^l,ij  ^.jGgroitp 

or,  equivalently,  that  to  C0>tDC-  ■ 

7.2.3  Properties  of  DeltaCon 

We  have  already  presented  our  scalable  algorithm  for  graph  similarity  in  detail,  and  the  only 
question  that  remains  from  a  theoretic  perspective  is  whether  DeltaCon  satisfies  the  axioms  and 
properties  presented  in  Section  7.1.4. 

Al.  Identity  Property:  sim(Gi,Gi)  =  1. 

The  affinity  scores,  S,  are  identical  for  the  input  graphs,  because  the  two  linear  systems  in 
Algorithm  7.4  are  exactly  the  same.  Thus,  the  RootED  distance  d  is  0,  and  the  DeltaCon 
similarity,  sim  =  y^,  is  equal  to  1. 

A2.  Symmetric  Property:  sim(Gi,G2)  =sim(G2,  Gi). 

Similarly,  the  equations  that  are  used  to  compute  sim(Gi,  G2)  and  sim(G2,  Gi)  are  the  same.  The 
only  difference  is  the  order  of  solving  them  in  Algorithm  7.4.  Therefore,  both  the  RootED  distance 
d  and  the  DeltaCon  similarity  score  sim  are  the  same. 
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A3.  Zero  Property:  sim(Gi,  G2)  — >  0  for  n  — >  00,  where  Gi  is  the  complete  graph  (Kn),  and  G2  is 
the  empty  graph  (i.e.,  the  edge  sets  are  complementary). 

Proof.  First  we  show  that  all  the  nodes  in  a  complete  graph  get  final  scores  in  {sg,  sng},  depending 
on  whether  they  are  included  in  group  g  or  not.  Then,  it  can  be  demonstrated  that  the  scores 
have  finite  limits,  and  specifically  {sg,sng}  ->  {^  +  1,  as  n  ->  00  (for  finite  ^).  Given  this 
condition,  it  can  be  derived  that  the  RootED,  d(Gi,  G2),  between  the  S  matrices  of  the  empty  and 
the  complete  graph  becomes  arbitrarily  large.  So,  sim(Gi,  G2)  =  1+d(g  g2)  0  f°r  u  ^  00.  ■ 

DeltaCon  satisfies  the  three  axioms  that  every  similarity  measure  must  obey.  We  elaborate  on 
the  satisfiability  of  the  properties  of  PI  —  P3  next. 


PI.  [Edge  Importance]  For  unweighted  graphs,  changes  that  create  disconnected  components 
should  be  penalized  more  than  changes  that  maintain  the  connectivity  properties  of  the  graphs. 

Formalizing  this  property  in  its  most  general  case  with  any  type  of  disconnected  components  is 
hard,  thus  we  focus  on  a  well-understood  and  intuitive  case:  the  barbell  graph. 


Proof.  Assume  that  A  is  the  adjacency  matrix  of  an  undirected  barbell  graph  with  two  cliques  of 
size  ni  and  rt2,  respectively  (e.g.,  BIO  with  ni  =  n.2  =  5  in  Figure  7.3)  and  (io,  jo)  is  the  “bridge” 
edge.  Without  loss  of  generality  we  can  assume  that  A  has  a  block-diagonal  form,  with  one  edge 
(io,  jo)  linking  the  two  blocks: 


Qij  =  < 


for  i,j  e  {l,...,ni}  and  i  /  j 
1  or  i,  j  €  {ni  +  1, . . .  ,11.2}  and  i  /  j 
or  (i,  j )  =  (i0,  jo)  or  (i,  j)  =  (jo, io) 
0  otherwise 
Then,  B  is  an  adjacency  matrix  with  elements 


A  = 


111  »(u,v) 

(v,u) 


0  for  one  pair  (i,  j)  f  (io,  jo)  and  (jo, io) 

Uij  otherwise 

and  C  is  an  adjacency  matrix  with  elements 

0  if  (i,j)  =  (i0,j0)  or  (i,j)  =  (j0,i0) 
atj  otherwise 

We  want  to  prove  that  sim(A,B)  >  sim(A,  C)  0  d(A,B)  ^  d(A,  C)  or,  equivalently,  that 

d2(A,  B)  <  d2(A,  C). 

From  Equation  7.1,  by  expressing  the  matrix  inversion  using  a  power  series  and  ignoring  the 
terms  of  greater  than  second  power,  for  matrix  A  we  obtain  the  solution: 


O1.i1) 

ain 


bhi  =  [I  +  (eA  -  e2DA)  +  (eA  -  e2DA)2  +  ...]*  =>  SA  =  I  +  eA  +  e2A2  -  e2DA, 


126 


where  the  index  (e.g.,  A)  denotes  the  graph  each  matrix  corresponds  to.  Now,  using  the  last 
equation,  we  can  write  out  the  elements  of  the  Sa,Sb,Sc  matrices,  and  derive  their  RootED 
distances: 

4  2 

d2(A,  B)  =  4(ri2  —  f)  — 2  +  2^- 

ci  C2 

d2(A,  C)  =  2(m  +  n2  -  2)e2  +  2e, 

where  ci  =  y/e  +  e2(n2  -  3)  +  i/e  +  e2(n2  -  2)  and  c2  =  v/e2(n2  -2)  +  1/6  +  e2(n2  -2),  and 

f  =  3  if  the  missing  edge  in  graph  B  and  the  “bridge”  edge  are  incident  to  the  same  node,  or  f  =  2 
in  any  other  case. 

We  can,  therefore  write  the  difference  of  the  distances  we  are  interested  in  as 


d2(A,  C)  —  d2(A,  B) 


2e 


^e(ni  +n2-f)  +  1  — 


By  observing  that  cl  ^  2\/{e]  and  c2  ^  \[i  for  u2  >  3  and  using  these  facts  in  the  last 
equation,  it  follows  that: 


d2(A,  C)  —  d2(A, B)  ^2e 


^e(ni  +  n2  —  f)  — 


e2(ni  -2)\ 


>  2e2  (ni  +  n2  —  f  —  e(ni  —  2)) . 


Given  that  n2  >  3  and  f  =  2  or  3,  we  obtain  that  n2  —  f  ^  0.  Moreover,  0  <  e  <  1  by  definition 
and  ni  —  eni  >  0.  From  these  inequalities,  it  immediately  follows  that  d2(A,  B)  ^  d2(A,  C).  We 
note  that  this  property  is  not  always  satisfied  by  the  Euclidean  distance.  ■ 


P2.  [Edge-“Submodularity”]  For  unweighted  graphs,  a  specific  change  is  more  important  in 
a  graph  with  few  edges  than  in  a  much  denser,  but  equally  sized  graph. 


Proof.  Let  A  be  the  adjacency  matrix  of  an  undirected  graph,  with  iua  non- zero  elements  aq  and 
cii0  jf)  =  1)  and  B  be  the  adjacency  matrix  of  another  graph  which  is  identical  to  A,  but  is  missing 
the  edge  (io,  jo)- 


bij  = 


if  (i,  j)  =  (i0,  jo)  or  (i,  j)  =  (j0,i0) 

otherwise 


Let’s  also  assume  another  pair  of  graphs  C  and  E4  defined  as  follows: 


c 


v  ~ 


for  ^  1  pair  (i,j)  ^  (io,  jo)  and  (i,  j)  ^  (jo, io) 
otherwise 


that  is  C  has  me  <  iua  non- zero  elements  and 

4We  use  E  instead  of  D  to  distinguish  the  adjacency  matrix  of  the  graph  from  the  diagonal  matrix  of 
degrees,  which  is  normally  defined  as  D. 
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0  for(i,j)  =  (i0,  jo)  or  (i,  j)  =  (jo, io) 
Cij  otherwise 


We  want  to  show  that  sim(A, B)  ^  sim(C,E)  <*=>  d(A,B)  <  d(C,E).  By  substituting  the 
RootED  distance,  it  turns  out  that  we  want  to  show  that 


where  stj  are  the  elements  of  the  corresponding  affinity  matrix  S.  These  are  defined  for  A  by 


expressing  the  matrix  inversion  in  Equation  7.1  using  a  power  series  and  ignoring  the  terms  of 
greater  than  second  power: 

bht  =  [I  +  (eA  —  e^D/v)  +  (eA  —  e^D/\J“  +  ...]e-i  Sa  =  I  +  eA  +  e^A^  —  c^Da, 

where  the  index  (e.g..  A)  denotes  the  graph  each  matrix  corresponds  to. 

Instead  of  the  proof,  we  show  a  representative  set  of  simulations  that  suggest  that  the  sub¬ 
modularity  property  is  satisfied  by  DeltaCon.  Specifically,  we  start  from  a  complete  graph  of 
size  n,  Go  =  Kn,  and  randomly  pick  an  edge  (io,  jo)-  Then,  we  generate  a  series  of  graphs,  Gt, 
derived  by  Gt-i  by  removing  one  new  edge  (it,  jt) -  Note  that  (it,  jt )  cannot  be  the  initially  chosen 
edge  (io,  jo)-  For  every  derived  graph,  we  compute  the  RootED  distance  between  itself  and  the 
same  graph  without  the  edge  (io,  jo)-  What  we  expect  to  see  is  that  the  distance  decreases  as  the 
number  of  edges  in  the  graph  increases.  In  other  words,  the  distance  between  a  sparse  graph  and 
the  same  graph  without  (io,  jo)  is  bigger  than  the  distance  between  a  denser  graph  and  the  same 
graph  missing  the  edge  (io,  jo)-  The  opposite  holds  for  the  DeltaCon  similarity  measure,  but  in 
the  proofs  we  use  distance  since  that  function  is  mathematically  easier  to  manipulate  than  the 
similarity  function.  In  Figure  7.2,  we  plot,  for  different  graph  sizes  n,  the  RootED  distance  as  a 
function  of  the  edges  in  the  graph  (from  left  to  right  the  graph  becomes  denser,  and  tends  to  the 
complete  graph  Kn).  ■ 

P3.  [Weight  Awareness]  In  weighted  graphs,  the  bigger  the  weight  of  the  removed  edge  is, 
the  greater  the  impact  on  the  similarity  measure  should  be. 

Lemma  5. 

Assume  that  Ga  is  a  graph  with  adjacency  matrix  A  and  elements  a;j  ^  0.  Also,  let  Gb 
be  a  graph  with  adjacency  matrix  B  and  elements 


ciij  +  k  if  (i,  j)  =  (io,  jo)  or  (i,j)  =  (jo,  to) 
cu;  otherwise 


where  k  is  a  positive  integer  (k  >  1).  Then,  it  holds  that  (Sb)^  >  (Sa)^- 
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Simulation  for  edge  submodularity  (nodes  =  100} 
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Figure  7.2:  Illustration  of  submodularity:  Simulations  for  different  graph  sizes  indicate  that  Delta- 
Con  satisfies  the  edge-submodularity  property.  The  distance  between  two  graphs  that  differ  only  in 
the  edge  (i0,  jo)  is  a  decreasing  function  of  the  number  of  edges  in  the  original  graph  Gt.  Reversely 
their  similarity  is  an  increasing  function  of  the  number  of  edges  in  the  original  graph  Gt. 
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Proof.  From  Equation  7.1,  by  expressing  the  matrix  inversion  using  a  power  series  and  ignoring 
the  terms  of  greater  than  second  power,  we  obtain  the  solution: 


bhi  =  [I  +  (eA  -  e2D)  +  (eA  -  e2D)2  +  ...]«*  =>  bhi  ~  [I  +  eA  +  e2(A2  -  D)]et, 


or  equivalently 
and 

We  note  that 

where 

and 


SA  =  I  +  eA  +  e2A2  -  e2DA 


S3  —  I  -F  eB  -F  c2B2  —  e2D3 . 


SB  -  SA  =  e(B  -  A)  +  e2(B2  -  A2)  -  e2(DB  -  DA 
(B-A)ij  = 


k  if  (i,  j )  =  (io,  jo)  or  (i,  j )  =  ( j0,  to) 
0  otherwise 


(Db  —  DA)tj  = 

By  applying  basic  algebra,  it  can  be  shown  that 


k  if  (i,j)  =  (io,j0)  or  (i,j)  =  (j0,io) 

0  otherwise. 


(B2 


=  (A2)tj  +  2k  •  aioj0  +  k2  if  (i,j)  =  (i0,i0)  or  (i,j)  =  (jo.  jo) 
>  (A2)tj  otherwise 


(7.4) 


since  ^  0  for  all  i,  j  by  definition.  Next,  observe  that  for  Equation  7.4,  we  have  3  cases: 

•  For  all  (i,j)  f  (io,io)  and  (jo,  jo),  we  have 

(SB  -  SA)«  =  e(B  -  A)«  +  e2(B2  -  A2)tj  >  0 


•  For  (i,  j )  =  (io,  io)  we  have 


(SB  —  SA)i0t0  =  e2k(2k  •  aioj0  +k  -  1)  >  0 


•  For  (i,  j)  =  (j0,  j0)  we  have 


(SB  -  SA)jojo  =  e2k(2k  •  aiojo  +  k  -  1)  >  0 


Hence,  for  all  (i,  j)  it  holds  that  (Ss)ij  ^  (SA)ij. 

Next  we  will  use  the  lemma  to  prove  the  weight  awareness  property. 
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Proof.  [Property  P3-Weight  Awareness]  We  formalize  the  weight  awareness  property  in  the 
following  way.  Let  A  be  the  adjacency  matrix  of  a  weighted,  undirected  graph,  with  elements  ay . 
Then,  B  is  equal  A  but  with  a  bigger  weight  for  the  edge  (io,  jo),  or  more  formally: 


by  = 


ay  +  k 


,aii 


if  (i,  j)  =  (io,  jo)  or  (i,  j)  =  (jo, io) 

otherwise 


Let  also  C  be  the  adjacency  matrix  of  another  graph  with  the  same  entries  as  A  except  for  Ci0j0, 
which  is  bigger  than  bt0y0 : 


ay  +  k' 


cy  — 


a 


lJ 


if  (i,j)  =  (io.  jo)  or  (i,j) 

otherwise 


(jo.  io) 


where  k'  >  k  is  an  integer.  To  prove  the  property,  it  suffices  to  show  that  sim(A,  B)  ^  sim(A,  C)  o 
d(A,B)  ^  d(A,  C).  Notice  that  this  formal  definition  includes  the  case  of  removing  an  edge  by 
assuming  that  ayy0  =  0  for  matrix  A. 

We  can  write  the  difference  of  the  squares  of  the  RootED  distances  as: 

n  n  n  n 

d2(A,  B)  -  d2(A,  C)  =  Y_  Z  ( ~  )2  “  Z  Z  ? 

i=l  j=l  i=l  j=l 

n  n 

( \/sc,y  \J sB,y )  (2-^/ SA,y  \/sB,y  \J sc,y )  ^  0 

i=l  1=1 

because  (SB)y  >  (SA)y,  (Sc)y  ^  (SA)y,  and  (Sc)y  ^  (SB)y  for  alii,  j  by  the  construction  of  the 
matrices  A,  B,  and  C,  and  Lemma  5  .  ■ 


In  Section  7.4,  we  show  experimentally  that  DeltaCon  not  only  satisfies  the  properties,  but 
also  that  other  similarity  and  distance  methods  fail  in  one  or  more  test  cases. 


7.3  DeltaCon-Attr:  Adding  Node  and  Edge  Attribution 

Thus  far,  we  have  broached  the  intuition  and  decisions  behind  developing  a  method  for  calculating 
graph  similarity.  However,  we  argue  that  computing  this  metric  is  only  half  the  battle  in  the  wider 
realm  of  change  detection  and  graph  understanding.  Equally  important  is  finding  out  why  the 
graph  changed  the  way  it  did.  One  way  of  doing  this  is  attributing  the  changes  to  nodes  and/or 
edges. 

Equipped  with  this  information,  we  can  draw  conclusions  with  respect  to  how  certain  changes 
impact  graph  connectivity  and  apply  this  understanding  in  a  domain-specific  context  to  assign 
blame,  as  well  as  instrument  measures  to  prevent  such  changes  in  the  future.  Additionally,  such  a 
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feature  can  be  used  to  measure  changes  which  have  not  yet  happened  in  order  to  find  information 
about  which  nodes  and/or  edges  are  most  important  for  preserving  or  destroying  connectivity.  In 
this  section,  we  will  discuss  our  extension  of  a  method,  called  DeltaCon-Attr,  which  enables 
node-  and  edge-level  attribution  for  this  very  purpose. 

7.3.1  Algorithm  Description 

Node  Attribution  Our  first  goal  is  to  find  the  nodes  which  are  mostly  responsible  for  the 
difference  between  the  input  graphs.  Let  the  affinity  matrices  S(  and  Sj  be  precomputed.  Then, 
the  steps  of  our  node  attribution  algorithm  (Algorithm  7.6)  can  be  summarized  to: 

Algorithm  7.6  DeltaCon-Attr  Node  Attribution 

INPUT:  affinity  matrices  S{,  S2 

edge  files  of  Gi(V,  £1)  and  G2(V,  £2),  i.e.,  Ai  and  A2 

for  v  =  1  — >  n  do 

//  If  an  edge  adjacent  to  the  node  has  changed,  the  node  is  responsible: 
if  Y_  |Ai(v, :)  —  A2(v,  :)|  >  0  then 
wv  =  RoOTEDjSj  v,  S£v) 

end  if 
end  for 

[wsorted,  wsortedirLdex]  =  sortRows(w,  1,  ‘descend’)  //  sort  rows  of  vector  w  on  column 
index  1 

//(node  impact  score)  by  descending  value  RETURN:  [wsorted>  wsortedindex] 


Step  1  Intuitively,  we  compute  the  difference  between  the  affinity  of  node  v  to  the  node  groups 
in  graph  A  and  the  affinity  of  node  v  to  the  node  groups  in  graph  A2.  To  that  end,  we  use  the  same 
distance,  RootED,  that  we  applied  to  find  the  similarity  between  the  whole  graphs. 

Given  that  the  vth  row  vector  (v  <  n)  of  S{  and  Sj  reflects  the  affinity  of  node  v  to  the 
remainder  of  the  graph,  the  RootED  distance  between  the  two  vectors  provides  a  measure  of  the 
extent  to  which  that  node  is  a  culprit  for  change  —  we  refer  to  this  measure  as  the  impact  of  a 
node.  Thus,  culprits  with  comparatively  high  impact  are  the  ones  that  are  the  most  responsible  for 
change  between  graphs. 

More  formally,  we  quantify  the  contribution  of  each  node  to  the  graph  changes  by  taking 
the  RootED  distance  between  each  corresponding  pair  of  row  vectors  in  S{  and  Sj  as  wv  for 
v  =  1, . . . ,  n  per  Equation  7.5. 


9 


(7.5) 
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Step  2  We  sort  the  scores  in  the  n  x  1  node  impact  vector  w  in  descending  order  and  report  the 
most  important  scores  and  their  corresponding  nodes. 

By  default,  we  report  culprits  responsible  for  the  top  80%  of  changes  using  a  similar  line  of 
reasoning  as  that  behind  Fukunaga’s  heuristic  [Fuk9  ].  In  practice,  we  find  that  the  notion  of  a 
skewed  impact  distribution  holds  (though  the  law  of  factor  sparsity  holds,  the  distribution  need 
not  be  80-20). 

Edge  Attribution  Complementarity  to  the  node  attribution  approach,  we  have  also  developed 
an  edge  attribution  method  which  ranks  edge  changes  (additions  and  deletions)  with  respect  to 
the  graph  changes.  The  steps  of  our  edge  attribution  algorithm  (Algorithm  7.7)  are: 

Step  1  We  assign  each  changed  edge  incident  to  at  least  one  node  in  the  culprit  set  an  impact 
score.  This  score  is  equal  to  the  sum  of  impact  scores  for  the  nodes  that  the  edge  connects  or 
disconnects. 

Our  goal  here  is  to  assign  edge  impact  according  to  the  degree  that  they  affect  the  nodes 
they  touch.  Since  even  the  removal  or  addition  of  a  single  edge  does  not  necessarily  impact  both 


Algorithm  7.7  DeltaCon-Attr  Edge  Attribution 

INPUT:  adjacency  matrices  Ai,  A2, 

culprit  set  of  interest  wsortedIndeXil  ...index  and 
node  impact  scores  w 

for  v  =  1  $  length(wsortedIndex j  ...index)  do 

srcNode  =  wsortedindeX;V 
r  =  A2,v  Ai;V 

for  k  =  1  — »  n  do 
destNode  =  k 

if  r^  =  1  then 

edgeScore  =  WsrcNode  +  wdestNode 

append  row  [srcNode,  destNode,  ‘+’,  edgeScore]  to  E 

end  if 

if  r^  =  —  1  then 

edgeScore  =  WsrcNode  +  wdestNode 

append  row  [srcNode,  destNode,  edgeScore]  to  E 

end  if 
end  for 
end  for 

Esorted  =  sortrows(E,  4,  ‘descend’)  //  sort  rows  of  matrix  E  on  column  index  4 

//  (edge  impact  score)  by  descending  value 

RETURN:  Esorted 
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incident  nodes  equally,  we  choose  the  sum  of  both  nodes’  scores  as  the  edge  impact  metric.  Thus, 
our  algorithm  will  rank  edges  which  touch  two  nodes  of  moderately  high  impact  more  importantly 
than  edges  which  touch  one  node  of  high  impact  but  another  of  low  impact. 

Step  2  We  sort  the  edge  impact  scores  in  descending  order  and  report  the  edges  in  order  of 
importance. 

Analysis  of  changed  edges  can  reveal  important  discrepancies  from  baseline  behavior.  Specifically, 
a  large  number  of  added  edges  or  removed  edges  with  individually  low  impact  is  indicative  of  star 
formation  or  destruction,  whereas  one  or  a  few  added  or  removed  edges  with  individually  high 
impact  are  indicative  of  community  expansion  or  reduction  via  addition  or  removal  of  certain  key 
bridge  edges. 

7.3.2  Scalability  Analysis 

Given  precomputed  S{  and  S(  (precomputation  is  assumed  since  attribution  can  only  be  conducted 
after  similarity  computation),  the  node  attribution  component  of  DeltaCon-Attr  is  loglinear  on 
the  number  of  nodes,  since  n  influence  scores  need  to  be  sorted  in  descending  order.  In  more 
detail,  the  cost  of  computing  the  impact  scores  for  nodes  is  linear  on  the  number  of  nodes  and 
groups,  but  is  dominated  by  the  sorting  cost  given  that  g  <c  log(n)  in  general. 

With  the  same  assumptions  with  respect  to  precomputed  results,  the  edge  attribution  portion 
of  DeltaCon-Attr  is  also  loglinear,  but  on  the  sum  of  edge  counts,  since  mi  +  m2  total  possible 
changed  edges  need  to  be  sorted.  In  practice,  the  number  of  edges  needing  to  be  sorted  should 
be  far  smaller,  as  we  only  need  concern  ourselves  with  edges  which  are  incident  to  nodes  in  the 
culprit  set  of  interest.  Specifically,  the  cost  of  computing  impact  scores  for  edges  is  linear  on  the 
number  of  nodes  in  the  culprit  set  k  and  the  number  of  changed  edges,  but  is  again  dominated  by 
the  sorting  cost  given  that  k  -c  log  (mi  +  m2)  in  general. 


7.4  Experiments 

We  conduct  several  experiments  on  synthetic  (Figure  7.3),  as  well  as  real  data  (Table  7.6  with 
undirected,  unweighted  graphs,  unless  stated  otherwise)  to  answer  the  following  questions: 

Ql.  Does  DeltaCon  agree  with  our  intuition  and  satisfy  the  axioms/properties?  Where  do 
other  methods  fail? 

Q2.  Is  DeltaCon  scalable  and  able  to  compare  large-scale  graphs? 

Q3.  How  sensitive  is  it  to  the  number  of  node  groups? 

We  implemented  the  code  in  Matlab  and  ran  the  experiments  on  AMD  Opteron  Processor  854 
@3GHz,  RAM  32GB.  The  selection  of  parameters  in  Equation  7.1  follows  the  lines  of  Chapter  4 — all 
the  parameters  are  chosen  so  that  the  system  converges. 
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7.4.1  Intuitiveness  of  DeltaCon 


To  answer  Ql,  for  the  first  3  properties  (P1-P3),  we  conduct  experiments  on  small  graphs  of  5  to 
100  nodes  and  classic  topologies  (cliques,  stars,  circles,  paths,  barbell  and  wheel-barbell  graphs, 
and  “lollipops”  shown  in  Figure  7.3),  since  people  can  argue  about  their  similarities.  For  the  name 
conventions,  see  Table  2.  For  our  method  we  used  five  groups  (g  =  5),  but  the  results  are  similar 
for  other  choices  of  the  parameter.  In  addition  to  synthetic  graphs,  for  informal  property  (IP),  we 
use  real  networks  with  up  to  11  million  edges  (Table  7.6). 

We  compare  our  method,  DeltaCon,  to  the  six  best  state-of-the-art  similarity  measures  that 
apply  to  our  setting: 

1.  Vertex/Edge  Overlap  (VEO):  In  [  5DGM08],  the  VEO  similarity  between  two  graphs  GiCVi,  £i) 
and  G2(V2,  £2)  is  defined  as: 


simv£o(Gi,  G2) 


„  |£i  n  £2|  +  \Vi  n  v2| 
|£il  +  |£2|  +  |Vil  +  |v2f 


2.  Graph  Edit  Distance  (GED):  GED  has  quadratic  complexity  in  general,  but  it  is  linear  on  the 

number  of  nodes  and  edges  when  only  insertions  and  deletions  are  allowed  [  ] . 

simGED(G1,G2)  =  |V1|  +  |V2|-2|V1nV2|  +  |£il  +  |£2|-2|£1n£2|. 

For  Vi  =  V2  and  unweighted  graphs,  simcED  is  equivalent  to  the  Hamming  distance  (HD) 
defined  as  HD(Ai,  A2)  =  sum(Ai  XOR  A2). 

3.  Signature  Similarity  (SS) :  This  is  the  best  performing  similarity  measure  studied  in  [PDGM08] . 
It  starts  from  node  and  edge  features,  and  by  applying  the  SimHash  algorithm  (random 
projection  based  method),  projects  the  features  to  a  small  dimension  feature  space,  which  is 
called  signature.  The  similarity  between  the  graphs  is  defined  as  the  similarity  between  their 
signatures. 

4.  The  last  3  methods  are  variations  of  the  well-studied  spectral  method  “A-distance”  ([  )KW06], 

[  :a03],  [  £08]).  Let  and  {A2i}[3=2|  be  the  eigenvalues  of  the  matrices  that  repre¬ 

sent  Gi  and  G2.  Then,  A-distance  is  given  by 


dx(Gi,  G2) 


k 


N 


(^li  -  ^2l)2, 

i=l 


where  k  is  max(|Vi|,  |V2|)  (padding  is  required  for  the  smallest  vector  of  eigenvalues).  The 
variations  of  the  method  are  based  on  three  different  matrix  representations  of  the  graphs: 
adjacency  (A-D  Adj.),  laplacian  (A-D  Lap.)  and  normalized  laplacian  matrix  (A-D  N.L.). 
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C5  mC5/P5  m2C5/mP5  mmBlO  w5B10 


S5  mS5  m2WhB12  mmWhB12  mm2WhB12 


Figure  7.3:  Small,  synthetic  graphs  used  in  the  DeltaCon  experimental  analysis  -  K:  clique ,  C: 
cycle ,  P:  path ,  S:  star,  B:  barbell ,  L:  lollipop ,  and  WhB:  wheel-barbell .  See  Table  7.2  for  the  name 
conventions  of  the  synthetic  graphs. 

Table  7.2:  Name  conventions  for  synthetic  graphs.  Missing  number  after  the  prefix  implies  X  =  1. 


Symbol 

Meaning 

K„ 

clique  of  size  n 

P„ 

path  of  size  n 

c„ 

cycle  of  size  n 

s„ 

star  of  size  n 

Ln 

lollipop  of  size  n 

B„ 

barbell  of  size  n 

WhB„ 

wheel  barbell  of  size  n 

mx 

missing  X  edges 

minx 

missing  X  “bridge”  edges 

w 

weight  of  “bridge”  edge 
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The  results  for  the  first  three  properties  are  presented  in  the  form  of  Tables  7.3-7  A.  For 
property  PI  we  compare  the  graphs  (A,B)  and  (A,C)  and  report  the  difference  between  the 
pairwise  similarities/distances  of  our  proposed  methods  and  the  6  state-of-the-art  methods.  We 
have  arranged  the  pairs  of  graphs  in  such  way  that  (A,B)  are  more  similar  than  (A,C).  Therefore, 
table  entries  that  are  non-positive  mean  that  the  corresponding  method  does  not  satisfy  the 
property.  Similarly,  for  properties  P2  and  P3,  we  compare  the  graphs  (A,B)  and  (C,D)  and  report 
the  difference  in  their  pairwise  similarity/distance  scores. 

PI.  Edge  Importance  “Edges  whose  removal  creates  disconnected  components  are  more  impor¬ 
tant  than  other  edges  whose  absence  does  not  affect  the  graph  connectivity.  The  more  important  an 
edge  is,  the  more  it  should  affect  the  similarity  or  distance  measure.” 

For  this  experiment  we  use  the  barbell,  “wheel  barbell”  and  “lollipop”  graphs  depicted  in 
Figure  7.3,  since  it  is  easy  to  argue  about  the  importance  of  the  individual  edges.  The  idea  is 
that  edges  in  a  highly  connected  component  (e.g.,  clique,  wheel)  are  not  very  important  from  the 
information  flow  viewpoint,  while  edges  that  connect  (almost  uniquely)  dense  components  play  a 
significant  role  in  the  connectivity  of  the  graph  and  the  information  flow.  The  importance  of  the 
“bridge”  edge  depends  on  the  size  of  the  components  that  it  connects;  the  bigger  the  components, 
the  more  important  the  role  of  the  edge  is. 

Observation  11.  Only  DeltaCon  succeeds  in  distinguishing  the  importance  of  the  edges  (PI) 
with  respect  to  connectivity,  while  all  the  other  methods  fail  at  least  once  (Table  7.3). 


P2.  “Edge-Submodularity”  “Let  A(V,  £i)  and  B(V,  £2)  be  two  unweighted  graphs  with  the 
same  node  set,  and  |£i|  >  182!  edges.  Also,  assume  that  mxA(V,  £1)  and  mxB(V,  £2)  are  the  respective 
derived  graphs  after  removing  x  edges.  We  expect  that  sim(A,  mxA)  >  sim(B,  mxB),  since  the  fewer 
the  edges  in  a  constant-sized  graph,  the  more  “ important ”  they  are.  ” 

The  results  for  different  graph  topologies  and  1  or  10  removed  edges  (prefixes ’m’  and  ’mlO’ 
respectively)  are  given  compactly  in  Table  7.4.  Recall  that  non-positive  values  denote  violation  of 
the  “edge-submodularity”  property. 

Observation  12.  Only  DeltaCon  complies  to  the  “edge-submodularity”  property  (P2)  in  all 
cases  examined. 


P3.  Weight  Awareness  “The  absence  of  an  edge  of  big  weight  is  more  important  than  the  absence 
of  a  smaller  weighted  edge;  this  should  be  reflected  in  the  similarity  measure.  ” 

The  weight  of  an  edge  defines  the  strength  of  the  connection  between  two  nodes,  and,  in  this 
sense,  can  be  viewed  as  a  feature  that  relates  to  the  importance  of  the  edge  in  the  graph.  For  this 
property,  we  study  the  weighted  versions  of  the  barbell  graph,  where  we  assume  that  all  the  edges 
except  the  “bridge”  have  unit  weight. 

Observation  13.  All  the  methods  are  weight-aware  (P3),  except  VEO  and  GED,  which  compute 
just  the  overlap  in  edges  and  vertices  between  the  graphs  (Table  7.5). 


137 


Table  7.3:  “Edge  Importance”  (PI).  Non-positive  entries  violate  PI. 


Graphs 

DC0 

DC 

VEO 

ss 

GED 

(XOR) 

A-d 

Adj. 

A-d 

Lap. 

A-d 

N.L. 

A 

B 

C 

As  sim(A,  B)  —  sim(A,  C) 

Ad  =  d(A,  C 

-d(A,B) 

BIO 

mBIO 

mmBlO 

0.07 

0.04 

0 

— 10-5 

0 

0.21 

-0.27 

2.14 

L10 

mL10 

mmLIO 

0.04 

0.02 

0 

10-5 

0 

-0.30 

-0.43 

-8.23 

WhBlO 

mWhBlO 

mmWhBlO 

0.03 

0.01 

0 

— 10-5 

0 

0.22 

0.18 

-0.41 

WhBlO 

m2  WhBlO 

mm2  WhBlO 

0.07 

0.04 

0 

— 10-5 

0 

0.59 

0.41 

0.87 

Table  7.4:  “Edge-Submodularity”  (P2).  Non-positive  entries  violate  P2. 


Graphs 

DC0 

DC 

VEO 

SS 

GED 

(XOR) 

A-d 

Adj. 

A-d 

Lap. 

A-d 

N.L. 

A 

B 

c 

D 

As  =  sim(  A,  E 

>)  —  sim 

(C.D) 

Ad  =  d(C,  D)  —  d(A,  B) 

K5 

mK5 

C5 

mC5 

0.03 

0.03 

0.02 

10-5 

0 

-0.24 

-0.59 

-7.77 

C5 

mC5 

P5 

mP5 

0.03 

0.01 

0.01 

-10~5 

0 

-0.55 

-0.39 

-0.20 

KlOO 

mKioo 

ClOO 

mCioo 

0.03 

0.02 

0.002 

10~5 

0 

-1.16 

-1.69 

-311 

ClOO 

mCioo 

PlOO 

mPioo 

10-4 

0.01 

10-5 

-10-5 

0 

-0.08 

-0.06 

-0.08 

KlOO 

mlOKioo 

ClOO 

mlOCioo 

0.10 

0.08 

0.02 

10-5 

0 

-3.48 

-4.52 

-1089 

ClOO 

mlOCioo 

PlOO 

mlOPioo 

0.001 

0.001 

10^5 

0 

0 

-0.03 

0.01 

0.31 

Table  7.5:  “Weight  Awareness”  (P3).  Non-positive  entries  violate  P3. 


Graphs 

DC0 

DC 

VEO 

ss 

GED 

(XOR) 

A-d 

Adj. 

A-d 

Lap. 

A-d 

N.L. 

A 

B 

C 

D 

As  =  sim(A,  B)  —  sim(C,  D) 

Ad  =  d(C,  D)  —  d(A,  B) 

BIO 

mBIO 

BIO 

w5B10 

0.09 

0.08 

-0.02 

10-5 

-1 

3.67 

5.61 

84.44 

mmBlO 

BIO 

mmBlO 

w5B10 

0.10 

0.10 

0 

10~4 

0 

4.57 

7.60 

95.61 

BIO 

mBIO 

w5B10 

w2B10 

0.06 

0.06 

-0.02 

10-5 

-1 

2.55 

3.77 

66.71 

w5B10 

w2B10 

w5B10 

mmBlO 

0.10 

0.07 

0.02 

10-5 

1 

2.23 

3.55 

31.04 

w5B10 

w2B10 

w5B10 

BIO 

0.03 

0.02 

0 

10-5 

0 

1.12 

1.84 

17.73 

Tables  7.3-7.5:  DeltaCon0  and  DeltaCon  (in  bold)  obey  all  the  formal  required  properties  (P1-P3 
Each  row  of  the  tables  corresponds  to  a  comparison  between  the  similarities  (or  distances)  of  two 
pairs  of  graphs;  pairs  (A,B)  and  (A,C)  for  property  (PI);  and  pairs  (A,B)  and  (C,D)  for  (P2)  and 
(P3).  Non-positive  values  of  As  =  sim(A,  B)  —  sim(C,  D)  and  Ad  =  d(C,D)  —  d(A,  B)  for  similarity 
and  distance  methods,  respectively,  are  highlighted  and  mean  violation  of  the  property  of  interest. 


IP.  Focus  Awareness  At  this  point,  all  the  competing  methods  have  failed  in  satisfying  at  least 
one  of  the  formal  desired  properties.  To  test  whether  DeltaCon  satisfies  our  informal  property,  i.e., 
it  is  able  to  distinguish  the  extent  of  a  change  in  a  graph,  we  analyze  real  datasets  with  up  to  1 1 
million  edges  (Table  7.6)  for  two  different  types  of  changes.  For  each  graph  we  create  corrupted 
instances  by  removing:  (i)  edges  from  the  original  graph  randomly,  and  (ii)  the  same  number  of 
edges  in  a  targeted  way  (we  randomly  choose  nodes  and  remove  all  their  edges,  until  we  have 
removed  the  appropriate  fraction  of  edges) . 
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Table  7.6:  Large  Real  and  Synthetic  Datasets 


Name 

Nodes 

Edges 

Description 

Brain  Graphs  Small  [OCP14] 

70 

800-1  208 

connectome 

Enron  Email  [KY04] 

36  692 

367  662 

who-emails-whom 

Facebook  wall  [V1V  !G09] 

45  813 

183412 

who-posts-to-whom 

Facebook  links  [VMCG09] 

63  731 

817  090 

friend-to-friend 

Epinions  [GKRT04] 

131828 

841 372 

who-trusts-whom 

Email  EU  [LKF07] 

265  214 

420045 

who-sent-to-whom 

Web  Notre  Dame  [SNA] 

325  729 

1 497  134 

site-to-site 

Web  Stanford  [SN/  ] 

281903 

2312497 

site-to-site 

Web  Google  [SNA] 

875  714 

5  105  039 

site-to-site 

Web  Berk/Stan  [SNJ  ] 

685  230 

7  600595 

site-to-site 

AS  Skitter  [LKF07] 

1696415 

11095 298 

p2p  links 

Brain  Graphs  Big  [OCP1  ] 

16777216 

49361  130-90492237 

connectome 

Kronecker  1 

6561 

65  536 

synthetic 

Kronecker  2 

19  683 

262  144 

synthetic 

Kronecker  3 

59049 

1  048  576 

synthetic 

Kronecker  4 

177  147 

4194  304 

synthetic 

Kronecker  5 

531441 

16777  216 

synthetic 

Kronecker  6 

1594323 

67  108  864 

synthetic 

For  this  property,  we  study  8  real  networks:  Email  EU  and  Enron  Emails,  Facebook  wall  and 
Facebook  links,  Google  and  Stanford  web,  Berkeley/Stanford  web  and  AS  Skitter.  In  Figure  7.4, 
we  give  the  DeltaCon  similarity  score  between  the  original  graph  and  the  corrupted  graph  with 
up  to  30%  removed  edges.  For  each  graph,  we  perform  the  experiment  for  random  (solid  line) 
and  targeted  (dashed  line)  edge  removal. 

Observation  14. 

•  “Targeted  changes  hurt  more.”  DeltaCon  is  focus-aware  (IP).  Removal  of  edges  in  a 
targeted  way  leads  to  smaller  similarity  of  the  derived  graph  to  the  original  one  than 
removal  of  the  same  number  of  edges  in  a  random  way. 

•  “More  changes:  random  «  targeted.”  In  Figure  7.4,  as  the  fraction  of  removed  edges 
increases,  the  similarity  score  for  random  changes  (solid  lines)  tends  to  the  similarity  score 
for  targeted  changes  (dashed  lines). 

This  is  expected,  because  the  random  and  targeted  edge  removal  tend  to  be  equivalent  when  a 
significant  fraction  of  edges  is  deleted. 

In  Figure  7.4(e)-(f),  we  give  the  similarity  score  as  a  function  of  the  percent  of  the  removed 
edges.  Specifically,  the  x  axis  corresponds  to  the  percentage  of  edges  that  have  been  removed 
from  the  original  graph,  and  the  y  axis  gives  the  similarity  score.  As  before,  each  point  maps  to 
the  similarity  score  between  the  original  graph  and  the  corresponding  corrupted  graph. 
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♦  Enron  email  random 

1 

♦  Facebook  wall  random 

♦  Email  EuAII  random 

0.9( 

♦  Facebook  Links  random 

■▼■Enron  email  targeted 

JJ  o.a 

■▼■Facebook  wall  targeted 

-▼■Email  EuAII  targeted 

0  0.7 
u  u 

♦♦ 

-▼■Facebook  Links  targeted 

2%  10%  20%  30% 

fraction  of  removed  edges 

(a)  Email  Networks 


2%  10%  20%  30% 

fraction  of  removed  edges 

(b)  Facebook  Networks 


♦Web  Stanford  random 

In 

♦Web  Google  random 
-▼  Web  Stanford  targeted 
■▼■Web  Google  targeted 

0.9 
B  0.8 
0  0.7< 

♦Web  Berkeley/Stanford  random 
♦AS  Skitter  random 
■▼■Web  Berkeley/Stanford  targeted 
■▼■AS  Skitter  targeted 

2%  10%  20%  30% 

fraction  of  removed  edges 

(c)  Web  Stanford  and  Google 


2%  10%  20%  30% 

fraction  of  removed  edges 

(d)  Web  Berkeley/Stanford  and  AS  Skitter 


♦  Enron  email 

li 

0.9 

♦Web  NotreDame 

♦  Facebook  wall 

♦Web  Stanford 

♦  Email  EuAII 

B  o.& 

♦Web  Google 

■▼■Facebook  Links 

0  0.1 
u  > 

■▼■Web  Berkeley/Stanford 

■▼■Epinions 

“  0.6 

. 

■▼■AS  Skitter 

2%  10%  20%  30% 

fraction  of  removed  edges 

(e)  Social  networks. 


2%  10%  20%  30% 

fraction  of  removed  edges 

(f)  Web  and  p2p  graphs. 


Figure  7.4:  DeltaCon  is  “focus-aware”  (IP):  Targeted  changes  hurt  more  than  random  ones,  (a)-(f): 
DeltaCon  similarity  scores  for  random  (solid  lines)  and  targeted  (dashed  lines)  changes  vs.  the 
fraction  of  removed  edges  in  the  “corrupted”  versions  of  the  original  graphs  (x  axis).  We  note  that 
the  dashed  lines  are  always  below  the  solid  lines  of  the  same  color,  (e)-(f) :DeltaCon  agrees  with 
intuition:  the  more  a  graph  changes  (i.e.,  the  number  of  removed  edges  increases),  the  smaller  is 
its  similarity  to  the  original  graph. 


140 


Observation  15.  “More  changes  hurt  more.”  The  higher  the  corruption  of  the  original  graph,  the 
smaller  the  DeltaCon  similarity  between  the  derived  and  the  original  graph  is.  In  Figure  7.4,  we 
observe  that  as  the  percentage  of  removed  edges  increases,  the  similarity  to  the  original  graph 
decreases  consistently  for  a  variety  of  real  graphs. 

General  Remarks.  All  in  all,  the  baseline  methods  have  several  non-desirable  properties.  The 
spectral  methods,  as  well  as  SS  fail  to  comply  with  the  “edge  importance”  (PI)  and  “edge 
submodularity”  (P2)  properties.  Moreover,  A-distance  has  high  computational  cost  when  the  whole 
graph  spectrum  is  computed,  cannot  distinguish  the  differences  between  co-spectral  graphs,  and 
sometimes  small  changes  lead  to  big  differences  in  the  graph  spectra.  VEO  and  GEDFocu  are 
oblivious  to  significant  structural  properties  of  the  graphs;  thus,  despite  their  straightforwardness 
and  fast  computation,  they  fail  to  discern  various  changes  in  the  graphs.  On  the  other  hand, 
DeltaCon  gives  tangible  similarity  scores  and  conforms  to  all  the  desired  properties. 

7.4.2  Intuitiveness  of  DeltaCon-Attr 

In  addition  to  evaluating  the  intuitiveness  of  DeltaCono  and  DeltaCon,  we  also  test  Delta- 
Con-Attr  on  a  number  of  synthetically  created  and  modified  graphs,  and  compare  it  to  the 
state-of-the-art  methods.  We  perform  two  types  of  experiments:  The  first  experiment  examines 
whether  the  ranking  of  the  culprit  nodes  by  our  method  agrees  with  intuition.  In  the  second 
experiment,  we  evaluate  DeltaCon-Attr’s  classification  accuracy  in  finding  culprits,  and  compare 
it  to  the  best-performing  competitive  approach,  CAD5  [SD1  ],  which  was  introduced  concurrently, 
and  independently  from  us.  CAD  uses  the  idea  of  commute  time  between  nodes  to  define  the 
anomalousness  of  nodes/edges.  In  a  random  walk,  the  commute  time  is  defined  as  the  expected 
number  of  steps  starting  at  i,  before  node  j  is  visited  and  then  node  i  is  reached  again.  We  give  a 
qualitative  comparison  between  DeltaCon-Attr  and  CAD  in  Section  8.5  (Node/Edge  Attribution). 

Ranking  Accuracy  We  extensively  tested  DeltaCon-Attr  on  a  number  of  synthetically  created 
and  modified  graphs,  and  compared  it  with  CAD.  We  note  that  CAD  was  designed  to  simply  identify 
culprits  in  time-evolving  graphs  without  ranking  them.  In  order  to  compare  it  with  our  method, 
we  adapted  CAD  so  that  it  returns  ranked  lists  of  node  and  edge  culprits:  (i)  We  rank  the  culprit 
edges  in  decreasing  order  of  edge  score  AE;  (ii)  To  each  node  v,  we  attach  a  score  equal  to  the 
sum  of  the  scores  of  its  adjacent  edges,  i.e.,  Xugn  (v)  AE((v,u)),  where  N(v)  are  the  neighbors  of 
v.  Subsequently,  we  rank  the  nodes  in  decreasing  order  of  attached  score. 

We  give  several  of  the  conducted  experiments  in  Table  7.7,  and  the  corresponding  graphs  in 
Figures  7.5  and  7.6.  Each  row  of  the  table  corresponds  to  a  comparison  between  graph  A  and 
graph  B.  The  node  and  edge  culprits  that  explain  the  main  differences  between  the  compared 
graphs  are  annotated  in  Figures  7.5  and  7.6.  The  darker  a  node  is,  the  higher  it  is  in  the  ranked  list 
of  node  culprits.  Similarly,  edges  that  are  adjacent  to  darker  nodes  are  higher  in  the  ranked  list  of 

5  CAD  was  originally  introduced  for  finding  culprit  nodes  and  edges  without  ranking  them.  We  extended 
the  proposed  method  to  rank  the  culprits. 
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edge  culprits  than  edges  that  are  adjacent  to  lighter  nodes.  If  the  returned  ranked  list  agrees  with 
the  expected  list  (according  to  the  formal  and  informal  properties),  we  characterize  the  attribution 
of  the  method  correct  (checkmark).  If  there  is  disagreement,  we  provide  the  ordered  list  that  the 
method  returned.  If  two  nodes  or  edges  are  tied,  we  use  “=”.  For  CAD  we  picked  the  parameter 
6  such  that  the  algorithm  returns  5  culprit  edges  and  their  adjacent  nodes.  Thus,  we  mark  the 
returned  list  with  if  CAD  outputs  5  culprits  while  more  exist.  For  each  comparison,  we  also 
give  the  properties  (last  column)  that  define  the  order  of  the  culprit  edges  and  nodes. 

Observation  16.  DeltaCon-Attr  reflects  the  desired  properties  (PI,  P2,  P3,  and  IP),  while 
CAD  fails  to  return  the  expected  ranked  lists  of  node  and  edge  culprits  in  several  cases. 

Next  we  explain  some  of  the  comparisons  that  we  present  in  Table  7.7: 

•  K5-mK5:  The  pair  consists  of  a  5-node  complete  graph  and  the  same  graph  with  one  missing 
edge,  (3,4).  DeltaCon-Attr  considers  nodes  3  and  4  top  culprits,  with  equal  rank,  due  to 
equivalent  loss  in  connectivity.  Edge  (3,4)  is  ranked  top,  and  is  essentially  the  only  changed 
edge.  CAD  finds  the  same  results. 

•  K5-m2K5:  The  pair  consists  of  a  5-node  complete  graph  and  the  same  graph  with  two 
missing  edges,  (3,4)  and  (3,5).  Both  DeltaCon-Attr  and  CAD  consider  node  3  the  top 


Table  7.7:  DeltaCon-Attr  obeys  all  the  required  properties.  Each  row  corresponds  to  a  comparison 
between  graph  A  and  graph  B,  and  evaluates  the  node  and  edge  attribution  of  DeltaCon-Attr  and 
CAD.  The  right  order  of  edges  and  nodes  is  marked  in  Figures  7.5  and  7.6.  We  give  the  ranking  of  a 
method  if  it  is  different  from  the  expected  one. 
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culprit,  because  two  of  its  adjacent  edges  were  removed.  Node  3  is  followed  by  4  and  5, 
which  are  tied  since  they  are  both  missing  one  adjacent  edge  (Property  IP).  The  removed 
edges,  (3,4)  and  (3,5),  are  considered  equally  responsible  for  the  difference  between  the 
two  input  graphs.  We  observe  similar  behavior  in  larger  complete  graphs  with  100  nodes 
(K100,  and  modified  graphs  mKlOO,  w5K100  etc.).  In  the  case  of  K100  and  mlOKlOO6,  CAD 
does  not  return  all  13  node  culprits  and  10  culprit  edges  because  its  parameter,  6,  was  set  so 
that  it  would  return  at  most  5  culprit  edges7 

6ml0K100  is  a  complete  graph  of  100  nodes  where  we  have  removed  10  edges:  (i)  6  of  the  edges  were 
adjacent  to  node  80 — (80, 82),  (80, 84),  (80, 86),  (80, 88),  (80, 90),  (80, 92);  (ii)  3  of  the  edges  were  adjacent 
to  node  30 — (30,50),  (30,60),  (30,70);  and  (iii)  edge  (1,4). 

7The  input  graphs  are  symmetric.  If  edge  (a,  b)  is  considered  culprit,  CAD  returns  both  (a,  b)  and  (b,  a). 


K 1 00  mK100  wSKKH)  m3K100 


P100 

Q  ^ 

w2P100 

Q  Q  Q)  Q)  -  -  -  (*j) 

W5P100 


Figure  7.5:  DeltaCon-Attr  respects  properties  P1-P3,  and  IP.  Nodes  marked  green  are  identified  as 
the  culprits  for  the  change  between  the  graphs.  Darker  shade  corresponds  to  higher  rank  in  the  list 
of  culprits.  Removed  and  weighted  edges  are  marked  red  and  green,  respectively. 
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Figure  7.6:  [continued]  DeltaCon-Attr  respects  properties  P1-P3,  and  IP.  Nodes  marked  green  are 
identified  as  the  culprits  for  the  change  between  the  graphs.  Darker  shade  corresponds  to  higher 
rank  in  the  list  of  culprits.  Removed  edges  are  marked  red. 
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•  BlO-mBlO  U  mmBlO:  We  compare  a  barbell  graph  of  10  nodes  to  the  same  graph  that 
is  missing  both  an  edge  from  a  clique,  (6,7),  and  the  bridge  edge,  (5,6).  As  expected, 
DeltaCon-Attr  finds  6,  5  and  7  as  top  culprits,  where  6  is  ranked  higher  than  5,  since  6  lost 
connectivity  to  both  nodes  5  and  7,  whereas  5  disconnected  only  from  6.  Node  5  is  ranked 
higher  than  7  because  the  removal  of  the  bridge  edge  is  more  important  than  the  removal 
of  (6,7)  within  the  second  clique  (Property  PI).  CAD  returns  the  same  results.  We  observe 
similar  results  in  the  case  of  the  larger  barbell  graphs  (B200,  mmB200,  w20B200,  m3B200). 

•  LlO-mLIO  U  mmLIO:  This  pair  of  graphs  corresponds  to  the  lollipop  graph,  L10,  and  the 
lollipop  variant,  mL10  n  mmLIO,  that  is  missing  one  edge  from  the  clique,  as  well  as  a 
bridge  edge.  Nodes  6,  5  and  4  are  considered  the  top  culprits  for  the  difference  in  the  graphs. 
Moreover,  6  is  ranked  more  responsible  for  the  change  than  5,  since  6  lost  connectivity  to 
a  more  strongly  connected  component  than  5  (Property  P2).  However,  CAD  ranks  node 
5  higher  than  node  6  despite  the  difference  in  the  connectivity  of  the  two  components 
(violation  of  P2). 

•  S5,  mS5:  We  compare  a  5-node  star  graph,  and  the  same  graph  missing  the  edge  (1,5). 
DeltaCon-Attr  considers  5  and  1  top  culprits,  with  5  ranking  higher  than  1,  as  the  edge 
removal  caused  a  loss  of  connectivity  from  node  5  to  all  the  peripheral  nodes  of  the  star, 
2, 3,4,  and  the  central  node,  1.  CAD  considers  nodes  1  and  5  equally  responsible,  ignoring 
the  difference  in  the  connectivity  of  the  components  (violation  of  P2) .  Similar  results  are 
observed  in  the  comparisons  between  the  larger  star  graphs-SlOO,  mSlOO,  m3S100,  wSlOO. 

•  Customl8-m2Customl8:  The  ranking  of  node  culprits  that  DeltaCon-Attr  finds  is  11, 
10,  18,  and  17.  The  nodes  11  and  10  are  considered  more  important  than  the  nodes  18  and 
17,  as  the  edge  removal  (10, 11)  creates  a  large  connected  component  and  a  small  chain  of 
4  nodes,  while  the  edge  removal  (17, 18)  leads  to  a  single  isolated  node  (18).  Node  10  is 
higher  in  the  culprit  list  than  node  1 1  because  it  loses  connectivity  to  a  denser  component. 
The  reasoning  is  similar  for  the  ranking  of  nodes  18  and  17.  CAD  does  not  consider  the 
differences  in  the  density  of  the  components,  and  leads  to  a  different  ranking  of  the  nodes. 

•  Customl8-m2Customl8:  The  ranking  of  node  culprits  that  DeltaCon-Attr  returns  is  5, 
6,  18,  and  17.  This  is  in  agreement  with  properties  PI  and  P3,  since  the  edge  (5,6)  is 
more  important  than  the  edge  (17, 18).  Node  5  is  more  responsible  than  node  6  for  the 
difference  between  the  two  graphs,  as  node  5  ends  up  having  reduced  connectivity  to  a 
denser  component.  This  property  is  ignored  by  CAD,  which  thus  results  in  different  node 
ranking. 

As  we  observe,  in  all  the  synthetic  and  easily  controlled  examples,  the  ranking  of  the  culprit 
nodes  and  edges  that  DeltaCon-Attr  finds  agrees  with  intuition. 
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Classification  Accuracy  To  further  evaluate  the  accuracy  of  DeltaCon-Attr  in  classifying 
nodes  as  culprits,  we  perform  a  simulation-based  experiment  and  compare  our  method  to  CAD. 
Specifically  we  set  up  a  simulation  similar  to  the  one  that  was  introduced  in  [  4] . 

We  sample  2000  points  from  a  2-dimensional  Gaussian  mixture  distribution  with  four  com¬ 
ponents,  and  construct  the  matrix  P  e  ^2000x2000^  entries  p(i,  j)  =  exp  ||i  —  j||,  for  each  pair 
of  points  (i,  j).  Intuitively,  the  adjacency  matrix  P  corresponds  to  a  graph  with  four  clusters  that 
have  strong  connections  within  them,  but  weaker  connections  across  them.  By  following  the  same 
process  and  adding  some  noise  in  each  component  of  the  mixture  model,  we  also  build  a  matrix  Q, 
and  add  more  noise  to  it,  which  is  defined  as: 


Rij  = 


0 

Uij  ~  11(0,1) 


with  probability  0.95 
otherwise, 


where  11(0, 1)  is  the  uniform  distribution  in  (0, 1).  Then,  we  compare  the  two  graphs,  Ga  and  Gb, 
to  each  other,  which  have  adjacency  matrices  A  =  P  and  B  =  Q  +  (R  +  R')/2,  respectively.  We 
consider  culprits  (or  anomalous)  the  inter-cluster  edges  for  which  R|j  /  0,  and  the  adjacent  nodes. 
According  to  property  PI,  these  edges  are  considered  important  (major  culprits)  for  the  difference 
between  the  graphs,  as  they  establish  more  connections  between  loosely  coupled  clusters. 

Conceptually,  DeltaCon-Attr  and  CAD  are  similar  because  they  are  based  on  related  meth¬ 
ods  [KKK 1  .  ]  (Belief  Propagation  and  Random  Walk  with  Restarts,  respectively).  As  shown  in 
Figure  7.7,  the  simulation  described  above  corroborates  this  argument,  and  the  two  methods  have 
comparable  performance  -  i.e.,  the  areas  under  the  ROC  curves  are  similar  for  various  realizations 
of  the  data  described  above.  Over  15  trials,  the  AUC  of  DeltaCon-Attr  and  CAD  is  0.9922  and 
0.9934,  respectively. 

Observation  17.  Both  methods  are  very  accurate  in  detecting  nodes  that  are  responsible  for 
the  differences  between  two  highly-clustered  graphs  (Property  PI). 


All  in  all,  both  DeltaCon-Attr  and  CAD  have  very  high  accuracy  in  detecting  culprit  nodes  and 
edges  that  explain  the  differences  between  two  input  graphs.  DeltaCon-Attr  satisfies  all  the 
desired  properties  that  define  the  importance  of  edges,  while  CAD  sometimes  fails  to  return  the 
expected  ranked  lists  of  culprits. 

7.4.3  Scalability  of  DeltaCon 

In  Section  7.1  we  demonstrated  that  DeltaCon  is  linear  on  the  number  of  edges,  and  here  we 
show  that  this  also  holds  in  practice.  We  ran  DeltaCon  on  Kronecker  graphs  (Table  7.6),  which 
are  known  [LCKF0  ]  to  share  many  properties  with  real  graphs. 

Observation  18.  As  shown  in  Figure  7.8,  DeltaCon  scales  linearly  with  the  number  of  edges  in 
the  largest  input  graph. 

We  note  that  the  algorithm  can  be  trivially  parallelized  by  finding  the  node  affinity  scores  of 
the  two  graphs  in  parallel  instead  of  sequential.  Moreover,  for  each  graph,  the  computation 
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ROC  Curves  for  DeltaCon  and  CAD 
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Figure  7.7:  DeltaCon-Attr  ties  state-of-the-art  method  with  respect  to  accuracy.  Each  plot  shows 
the  ROC  curves  for  DeltaCon-Attr  and  CAD  for  different  realizations  of  two  synthetic  graphs.  The 
graphs  are  generated  from  points  sampled  from  a  2-dimensional  Gaussian  mixture  distribution  with 
four  components. 
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of  the  similarity  scores  of  the  nodes  to  each  of  the  g  groups  can  be  parallelized.  However,  the 
runtime  of  our  experiments  refer  to  the  sequential  implementation.  The  amount  of  time  taken 
for  DeltaCon-Attr  is  trivial  even  for  large  graphs,  given  that  the  necessary  affinity  matrices 
are  already  in  memory  from  the  DeltaCon  similarity  computation.  Specifically,  node  and  edge 
attribution  are  log-linear  on  the  nodes  and  edges,  respectively,  given  that  sorting  is  unavoidable 
for  the  task  of  ranking. 


Figure  7.8:  DeltaCon  is  linear  on  the  number  of  edges  (time  in  seconds  vs.  number  of  edges).  The 
exact  number  of  edges  is  annotated. 


7.4.4  Robustness  of  DeltaCon 

DeltaCono  satisfies  all  the  desired  properties,  but  its  runtime  is  quadratic  and  does  not  scale  well 
to  large  graphs  with  more  than  a  few  million  edges.  On  the  other  hand,  our  second  proposed 
algorithm,  DeltaCon,  is  scalable  both  in  theory  and  practice  (Lemma  3,  Section  7.4.3).  In  this 
section  we  present  the  sensitivity  of  DeltaCon  to  the  number  of  groups  g,  as  well  as  how  the 
similarity  scores  of  DeltaCon  and  DeltaCono  compare. 

For  this  experiment,  we  use  complete  and  star  graphs,  as  well  as  the  Political  Blogs  dataset.  For 
each  of  the  synthetic  graphs  (a  complete  graph  with  100  nodes  and  star  graph  with  100  nodes), 
we  create  three  corrupted  versions  where  we  remove  1,  3  and  10  edges,  respectively.  For  the 
real  dataset,  we  create  four  corrupted  versions  of  the  Political  Blogs  graph  by  removing  {10%, 
20%,  40%,  80%}  of  the  edges.  For  each  pair  of  coriginal,  corrupted>  graphs,  we  compute  the 
DeltaCon  similarity  for  varying  number  of  groups.  We  note  that  when  g  =  n,  DeltaCon  is 
equivalent  to  DeltaCono-  The  results  for  the  synthetic  and  real  graphs  are  shown  in  Figures  7.9(a) 
and  7.9(b),  respectively. 

Observation  19.  In  our  experiments,  DeltaCono  and  DeltaCon  agree  on  the  ordering  of 
pairwise  graph  similarities. 
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(a)  Robustness  of  method  on  synthetic 
graphs. 


(b)  Robustness  of  method  on  the  Political 
Blogs  dataset. 


Figure  7.9:  DeltaCon  is  robust  to  the  number  of  groups.  And  more  importantly,  at  every  group  level, 
the  ordering  of  similarities  of  the  different  graph  pairs  remains  the  same  (e.g.,  sim(K100,  mlKlOO)  > 
sim(K100,m.3K100)  >  . . .  >  sim(S100,  mlSlOO)  >  . . .  >  sim(S100,  mlOSlOO)). 


In  Figure  7.9(b)  the  lines  not  only  do  not  cross,  but  are  almost  parallel  to  each  other.  This  means 
that  for  a  fixed  number  of  groups  g,  the  differences  between  the  similarities  of  the  different  graph 
pairs  remain  the  same.  Equivalently,  the  ordering  of  similarities  is  the  same  at  every  group  level  g. 

Observation  20.  The  similarity  scores  of  DeltaCon  are  robust  to  the  number  of  groups,  g. 

Obviously,  the  bigger  the  number  of  groups,  the  closer  are  the  similarity  scores  to  the  “ideal” 
similarity  scores,  i.e.,  scores  of  DeltaConq.  For  instance,  in  Figure  7.9(b),  when  each  blog  is  in  its 
own  group  (g  =  n  =  1490),  the  similarity  scores  between  the  original  network  and  the  derived 
networks  (with  any  level  of  corruption)  are  identical  to  the  scores  of  DeltaConq.  However,  even 
with  few  groups,  the  approximation  of  DeltaCono  is  good. 

It  is  worth  mentioning  that  the  bigger  the  number  of  groups  g,  the  bigger  runtime  is  required, 
since  the  complexity  of  the  algorithm  is  0(g  •  max{ml,  m2}|).  Not  only  the  accuracy,  but  also  the 
runtime  increases  with  the  number  of  groups;  so,  the  speed  and  accuracy  trade-offs  need  to  be 
conciliated.  Experimentally,  a  good  compromise  is  achieved  even  for  g  smaller  than  100. 


7.5  DeltaCon  &  DeltaCon-Attr  at  Work 

In  this  section  we  present  three  applications  of  our  graph  similarity  algorithms,  one  of  which 
comes  from  social  networks  and  the  other  two  from  the  area  of  neuroscience. 

7.5.1  Enron 

Graph  Similarity  First,  we  employ  DeltaCon  to  analyze  the  ENRON  dataset,  which  consists  of 
emails  sent  among  employees  in  a  span  of  more  than  two  years.  Figure  7.10  depicts  the  DeltaCon 
similarity  scores  between  consecutive  daily  who-emailed-whom  graphs.  By  applying  Quality 
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Control  with  Individual  Moving  Range,  we  obtain  the  lower  and  upper  limits  of  the  in-control 
similarity  scores.  These  limits  correspond  to  median  ±3cr8.  Using  this  method,  we  were  able  to 
define  the  threshold  (lower  control  limit)  below  which  the  corresponding  days  are  anomalous, 

i.e.,  they  differ  “too  much”  from  the  previous  and  following  days.  Note  that  all  the  anomalous 
days  relate  to  crucial  events  in  the  company’s  history  in  2001  (points  marked  with  red  boxes  in 
Figure  7.10): 

1.  May  22nd,  2001:  Jordan  Mintz  sends  a  memorandum  to  Jeffrey  Skilling  (CEO  for  a  few 
months)  for  his  sign-off  on  LJM  paperwork. 

2.  August  21st,  2001:  Kenneth  Lay,  the  CEO  of  Enron,  emails  all  employees  stating  he  wants 
“to  restore  investor  confidence  in  Enron.”; 

3.  September  26th,  2001:  Lay  tells  employees  that  the  accounting  practices  are  “legal  and 
totally  appropriate”,  and  that  the  stock  is  “an  incredible  bargain.”; 

4.  October  5th,  2001:  Just  before  Arthur  Andersen  hired  Davis  Polk  &  Wardwell  law  firm  to 
prepare  a  defense  for  the  company; 

5.  October  24-25th,  2001:  Jeff  McMahon  takes  over  as  CFO.  Email  to  all  employees  states  that 
all  the  pertinent  documents  should  be  preserved; 

6.  November  8th,  2001:  Enron  announces  it  overstated  profits  by  586  million  dollars  over  five 
years. 

7.  February  4th,  2002:  Lay  resigns  from  board. 

Although  high  similarities  between  consecutive  days  do  not  consist  anomalies,  we  found  that 
mostly  weekends  expose  high  similarities.  For  instance,  the  first  two  points  of  100%  similarity 
correspond  to  the  weekend  before  Christmas  in  2000  and  a  weekend  in  July,  when  only  two 
employees  sent  emails  to  each  other.  It  is  noticeable  that  after  February,  2002,  many  consecutive 
days  are  very  similar;  this  happens  because,  after  the  collapse  of  Enron,  the  email  exchange  activity 
was  rather  low  and  often  between  certain  employees. 

Attribution  We  additionally  apply  DeltaCon-Attr  to  the  ENRON  dataset  for  the  months  of 
May,  2001  and  February,  2002,  which  are  the  most  anomalous  months  according  to  the  analysis  of 
the  data  on  a  month- to-month  timescale.  Based  on  the  node  and  edge  rankings  produced  as  a 
result  of  our  method,  we  drew  some  interesting  real-world  conclusions  . 

May  2001: 

•  Top  Influential  Culprit:  John  Lavorato,  the  former  head  of  Enron’s  trading  operations  and 
CEO  of  Enron  America,  connected  to  ~50  new  nodes  in  this  month. 

•  Second  Most  Influential  Culprit:  Andy  Zipper,  VP  of  Enron  Online,  maintained  contact  with 
all  those  from  the  previous  month,  but  also  connected  to  12  new  people. 

8The  median  is  used  instead  of  the  mean,  since  appropriate  hypothesis  tests  demonstrate  that  the  data 
does  not  follow  the  normal  distribution.  Moving  range  mean  is  used  to  estimate  cr. 


150 


ENRON:  Similarities  between  consecutive  days  and  events 


May 22,  2001:  Jordan  Mintz  sends  a  memorandum  to 
Jeffrey  Skilling  for  his  sign-off  on  LJM  paperwork. 


•Similarity  scores 
■  LCL  (Lower  Control  Limit) 
■UCL  (Upper  Control  Limit) 
Mean  of  sim.  scores 


Feb  4,  2002:  Lay  resigns  from  the  board. 
February  7,  2002:  Skilling,  Fastow, 
Michael  Kopper  appear  at  Congress  with 
McMahon  and  in-house  Enron  lawyer 
Jordan  Mintz. 
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Figure  7.10:  DeltaCon  detects  anomalies  on  the  Enron  data  coinciding  with  major  events.  The 
marked  days  correspond  to  anomalies.  The  blue  points  are  similarity  scores  between  consecutive 
instances  of  the  daily  email  activity  between  the  employees,  and  the  marked  days  are  3  a  units  away 
from  the  median  similarity  score. 


•  Third  Most  Influential  Culprit:  Louise  Kitchen,  another  employee  (President  of  ENRON 
Online)  lost  5-6  connections  and  made  5-6  connections.  Most  likely,  some  of  the  connections 
she  lost  or  made  were  very  significant  in  terms  of  expanding/reducing  her  office  network. 

February  2002: 

•  Top  Influential  Culprit:  Liz  Taylor  lost  51  connections  this  month,  but  made  no  new  ones 
-  it  is  reasonable  to  assume  that  she  likely  quit  the  position  or  was  fired  (most  influential 
culprit) 

•  Second  Most  Influential  Culprit:  Louise  Kitchen  (third  culprit  in  May  2001)  made  no  new 
connections,  but  lost  22  existing  ones. 

•  Third  Most  Influential  Culprit:  Stan  Horton  (CEO  of  Enron  Transportation)  made  6  new 
connections  and  lost  none.  Some  of  these  connections  are  likely  significant  in  terms  of 
expanding  his  office  network. 

•  Fourth,  Fift  and  Sixth  Most  Influential  Culprits:  Employees  Kam  Keiser,  Mike  Grigsby  (former 
VP  for  Enron’s  Energy  Services)  and  Fletcher  Sturm  (VP)  all  lost  many  connections  and 
made  no  new  ones.  Their  situations  were  likely  similar  to  those  of  Liz  Taylor  and  Louise 
Kitchen. 
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7.5.2  Brain  Connectivity  Graph  Clustering 


We  also  use  DeltaCon  for  the  clustering  and  classification  of  graphs.  For  this  purpose  we  study 
conectomes,  i.e.,  brain  graphs,  which  are  obtained  by  Multimodal  Magnetic  Resonance  Imaging 

[GBV+12], 

In  total,  we  study  the  connectomes  of  114  people  which  are  related  to  attributes  such  as 
age,  gender,  IQ,  etc.  Each  brain  graph  consists  of  70  cortical  regions  (nodes),  and  connections 
(weighted  edges)  between  them  (see  Table  7.6  “Brain  Graphs  Small”).  We  ignore  the  strength  of 
connections  and  derive  one  undirected,  unweighted  brain  graph  per  person. 

We  first  compute  the  DeltaCon  pairwise  similarities  between  the  brain  graphs,  and  then 
perform  hierarchical  clustering  using  Ward’s  method  (Figure  7.1(b)).  As  shown  in  the  figure,  there 
are  two  clearly  separable  groups  of  brain  graphs.  Applying  a  t-test  on  the  available  attributes  for 
the  two  groups  created  by  the  clusters,  we  have  found  that  the  latter  differ  significantly  (p  <  .01) 
in  the  Composite  Creativity  Index  (CCI),  which  is  related  to  the  person’s  performance  on  a  series 
of  creativity  tasks.  Figure  7.11  illustrates  the  brain  connections  in  a  subject  with  high  and  low 
creativity  index.  It  appears  that  more  creative  subjects  have  more  and  heavier  connections  across 
their  hemispheres  than  those  subjects  that  are  less  creative.  Moreover,  the  two  groups  correspond 
to  significantly  different  openness  index  (p  =  .0558),  one  of  the  “Big  Five  Factors”;  that  is,  the  brain 
connectivity  is  different  in  people  that  are  inventive  and  people  that  are  consistent.  Exploiting 
analysis  of  variance  (ANOVA:  generalization  of  the  t-test  when  more  than  2  groups  are  analyzed), 
we  tested  whether  the  various  clusters  that  we  obtain  from  the  connectivity-based  hierarchical 
clustering  map  to  differences  in  other  attributes.  However,  in  the  dataset  we  studied  there  is  no 
sufficient  statistical  evidence  that  age,  gender,  IQ,  etc.  are  related  to  brain  connectivity. 


(b)  Brain  graph  of  subject  with 
low  creativity  index. 


(a)  Brain  graph  of  subject  with 
high  creativity  index. 

Figure  7.11:  Illustration  of  brain  graphs  for  subjects  with  high  and  low  creativity  index,  respectively. 
The  low-CCI  brain  has  fewer  and  lighter  cross-hemisphere  connections  than  the  high-CCI  brain. 
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Figure  7.12:  DeltaCon  outperforms  almost  all  the  baseline  approaches.  It  recovers  correctly  all 
pairs  of  connectomes  that  correspond  to  the  same  subject  (100%  accuracy)  for  weighted  graphs  out¬ 
performing  all  the  baselines.  It  also  recovers  almost  all  the  pairs  correctly  in  the  case  of  unweighted 
graphs,  following  VEO,  which  has  the  best  accuracy. 


7.5.3  Test-Retest:  Big  Brain  Connectomes. 

We  also  applied  our  method  on  the  KKI-42  dataset  [RKM+13,  OCP  ],  which  consists  of  con¬ 
nectomes  corresponding  to  n  =  21  subjects.  Each  subject  underwent  two  Functional  MRI  scans 
at  different  times,  and  so  the  dataset  has  2n  =  42  large  connectomes  with  ~  17  million  voxels 
and  49.3  to  90.4  million  connections  among  them  (see  Table  7.6  “Brain  Graphs  Big”).  Our  goal 
is  to  recover  the  pairs  of  connectomes  that  correspond  to  the  same  subject,  by  relying  only  on 
the  structures  of  the  brain  graphs.  In  the  following  analysis,  we  compare  our  method  to  the 
standard  approach  in  neuroscience  literature  [RKM  1  ],  the  Euclidean  distance  (as  induced  by 

the  Frobenius  norm),  and  also  to  the  baseline  approaches  we  introduced  in  Section  7.4. 

We  ran  the  following  experiments  on  a  32-cores  Intel(R)  Xeon(R)  CPU  E7-8837  at  2.67GHz, 
with  1TB  of  RAM.  The  signature  similarity  method  runs  out  of  memory  for  these  large  graphs,  and 
we  could  not  use  it  for  this  application.  Moreover,  the  variants  of  A-distance  are  computationally 
expensive,  even  when  we  compute  only  a  few  top  eigenvalues,  and  they  also  perform  very  poorly 
in  this  task. 

Unweighted  graphs.  The  brain  graphs  that  were  obtained  from  the  FMRI  scans  have  weighted 
edges  that  correspond  to  the  strength  of  the  connections  between  different  voxels.  The  weights 
tend  to  be  noisy,  so  we  initially  ignore  them,  and  treat  the  brain  graphs  as  binary  by  mapping 
all  the  non-zero  weights  to  1 .  To  discover  the  pairs  of  connectomes  that  correspond  to  the  same 
subject,  we  first  find  the  DeltaCon  pairwise  similarities  between  the  connectomes.  We  note  that  it 
suffices  to  do  f2’1)  =861  graph  comparisons,  since  the  DeltaCon  similarities  are  symmetric.  Then, 
we  find  the  potential  pairs  of  connectomes  that  belong  to  the  same  subject  by  using  the  following 
approach:  For  each  connectome  Q  €  {1, ... ,  2n},  we  choose  the  connectome  Cj  €  {1, ... ,  2n}  \  1 
such  that  the  similarity  score,  sim(Ci,  Cj ),  is  maximized.  In  other  words,  we  pair  each  connectome 
with  its  most  similar  graph  (excluding  itself)  as  defined  by  DeltaCon.  This  results  in  97.62% 
accuracy  of  predicting  the  connectomes  of  the  same  subject. 
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Source  voxels 


(a)  The  test  brain  graph  of  a  32-year-old  male. 
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(b)  The  true  re-test  brain  graph  of  the  32-year-old  (c)  The  recovered  re-test  brain  graph  by  the  Eu- 
male  in  (a).  clidean  distance. 


Figure  7.13:  DeltaCon  outperforms  the  Euclidean  distance  in  recovering  the  correct  test-retest  pairs 
of  brain  graphs.  We  depict  the  spy  plots  of  three  brain  graphs,  across  which  the  order  of  nodes  is 
the  same  and  corresponds  to  increasing  order  of  degrees  for  the  leftmost  spy  plot.  The  correct  test- 
retest  pair  (a)-(b)  that  DeltaCon  recovers  consists  of  “visually”  more  similar  brain  graphs  than  the 
incorrect  test-retest  pair  (a) -(c)  that  the  Euclidean  distance  found. 


In  addition  to  our  method,  we  compute  the  pairwise  Euclidean  distances  (ED)  between  the 
connectomes  and  evaluate  the  predictive  power  of  ED.  Specifically,  we  compute  the  quantities 
ED(i,  j)  =  ||Ci  —  Cj  ||p,  where  Ct  and  Cj  are  the  binary  adjacency  matrices  of  the  connectomes  i 
and  j,  respectively,  and  ||  •  ||p  is  the  Frobenius  norm  of  the  enclosed  matrix.  As  before,  for  each 
connectome  i  e  {1, . . . ,  2n},  we  choose  the  connectome  j  e  {1, . . . ,  2n}  \  i  such  that  ED(i,  j)  is 
minimized9,  the  accuracy  of  recovering  the  pairs  of  connectomes  that  correspond  to  the  same 

9We  note  that  DeltaCon  computes  the  similarity  between  two  graphs,  while  ED  computes  their  distance. 
Thus,  when  trying  to  find  the  pairs  of  connectomes  that  belong  to  the  same  subject,  we  want  to  maximize 
the  similarity,  or  equivalently,  minimize  the  distance. 
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subject  is  92.86%  (vs.  97.62%  for  DeltaCon)  as  shown  in  Figure  7.12. 

Finally,  from  the  baseline  approaches,  the  Vertex/Edge  Overlap  similarity  performs  slightly 
better  than  our  method,  while  GED  has  the  same  accuracy  as  the  Euclidean  distance  (Figure  7.12, 
‘Unweighted  Graphs’).  All  the  variants  of  A-distance  perform  very  poorly  with  2.38%  accuracy.  As 
mentioned  before,  the  signature  similarity  runs  out  of  memory  and,  hence,  could  not  be  used  for 
this  application. 

Weighted  graphs.  We  also  wanted  to  see  how  accurately  the  methods  can  recover  the  pairs  of 
weighted  connectomes  that  belong  to  the  same  subject.  Given  that  the  weights  are  noisy,  we  follow 
the  common  practice  and  first  smooth  them  by  applying  the  logarithm  function  (with  base  10). 
Then  we  follow  the  procedure  described  above  both  for  DeltaCon  and  the  Euclidean  distance. 
Our  method  yields  100%  accuracy  in  recovering  the  pairs,  while  the  Euclidean  distance  results 
in  92.86%  accuracy.  The  results  are  shown  in  Figure  7.12  (labeled  Weighted  Graphs’),  while 
Figure  7.13  shows  a  case  of  brain  graphs  that  were  incorrectly  recovered  by  the  ED-based  method, 
but  successfully  found  by  DeltaCon. 

In  the  case  of  weighted  graphs,  as  shown  in  Figure  7.12,  all  the  baseline  approaches  perform 
worse  than  DeltaCon  in  recovering  the  correct  brain  graph  pairs.  In  the  plot  we  include  the 
methods  that  have  comparable  performance  to  our  method.  The  A-distance  has  the  same  very 
poor  accuracy  (2.38%  for  all  the  variants)  as  in  the  case  of  unweighted  graphs,  while  the  signature 
similarity  could  not  be  applied  due  to  its  very  high  memory  requirements. 

Therefore,  by  using  DeltaCon  we  are  able  to  recover,  with  almost  perfect  accuracy,  which 
large  connectomes  belong  to  the  same  subject.  On  the  other  hand,  the  commonly  used  technique, 
the  Euclidean  distance,  as  well  as  the  baseline  methods  fail  to  detect  several  connectome  pairs 
(with  the  exception  of  VEO  in  the  case  of  unweighted  graphs) . 


7.6  Related  Work 

The  related  work  comprises  three  main  areas:  Graph  similarity,  node  affinity  algorithms,  and 
temporal  anomaly  detection  with  node  attribution.  We  give  the  related  work  in  each  area  separately, 
and  mention  what  sets  our  method  apart. 

Graph  Similarity.  Graph  similarity  refers  to  the  problem  of  quantifying  how  similar  two  graphs 
are.  The  graphs  may  have  the  same  or  different  sets  of  nodes,  and  can  be  divided  into  two  main 
categories: 

(1)  With  Known  Node  Correspondence.  The  first  category  assumes  that  the  two  given  graphs  are 
aligned,  or,  in  other  words,  the  node  correspondence  between  the  two  graphs  is  given.  [  MOf  ] 
proposes  five  similarity  measures  for  directed  web  graphs,  which  are  applied  for  anomaly  detection. 
Among  them  the  best  is  Signature  Similarity  (SS),  which  is  based  on  the  SimHash  algorithm, 
while  Vertex/Edge  Overlap  similarity  (VEO)  also  performs  very  well.  Bunke  [  1DKW06]  presents 
techniques  used  to  track  sudden  changes  in  communications  networks  for  performance  monitoring. 
The  best  approaches  are  the  Graph  Edit  Distance  and  Maximum  Common  Subgraph.  Both  are 
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NP-complete,  but  the  former  approach  can  be  simplified  given  the  application  and  it  becomes  linear 
on  the  number  of  nodes  and  edges  in  the  graphs.  This  chapter  attacks  the  graph  similarity  problem 
with  known  node  correspondence,  and  is  an  extension  of  the  work  in  [  13],  where  DeltaCon 

was  first  introduced.  In  addition  to  computational  methods  to  assess  the  similarity  between  graphs, 
there  is  also  a  line  of  work  on  visualization-based  graph  comparison.  These  techniques  are  based  on 
the  side-by-side  visualization  of  the  two  networks  [AWW09,  HD  12],  or  superimposed/augmented 
graph  or  matrix  views  [ABHR+13,  HK7  03] .  A  review  of  visualization-based  comparison  of 
information  based  on  these  and  additional  techniques  is  given  in  [  ].  [  1  13]  inves¬ 

tigate  ways  of  visualizing  differences  between  small  brain  graphs  using  either  augmented  graph 
representations  or  augmented  adjacency  matrices.  However,  their  approach  works  for  small  and 
sparse  graphs  (40-80  nodes).  Honeycomb  [  D0(  ]  is  a  matrix-based  visualization  tool  that 

handles  larger  graphs  with  several  thousands  edges,  and  performs  temporal  analysis  of  a  graph  by 
showing  the  time  series  of  graph  properties  that  are  of  interest.  The  visualization  methods  do  not 
compute  the  similarity  score  between  two  graphs,  but  only  show  their  differences.  This  is  related 
to  the  culprit  nodes  and  edges  that  our  method  DeltaCon  finds.  However,  these  methods  tend 
to  visualize  all  the  differences  between  two  graphs,  while  our  algorithm  routes  attention  to  the 
nodes  and  edges  that  are  mostly  responsible  for  the  differences  among  the  input  graphs.  All  in  all, 
visualizing  and  comparing  graphs  with  millions  or  billions  of  nodes  and  edges  remains  a  challenge, 
and  best  suits  small  problems. 

(2)  With  Unknown  Node  Correspondence.  The  previous  works  assume  that  the  correspondence  of 
nodes  across  the  two  graphs  is  known,  but  this  is  not  always  the  case.  Social  network  analysis, 
bioinformatics,  and  pattern  recognition  are  just  a  few  domains  with  applications  where  the  node 
correspondence  information  is  missing  or  even  constitutes  the  objective.  The  works  attacking 
this  problem  can  be  divided  into  three  main  approaches:  (a)  feature  extraction  and  similarity 
computation  based  on  the  feature  space,  (b)  graph  matching  and  the  application  of  techniques 
from  the  first  category,  and  (c)  graph  kernels. 

There  are  numerous  works  that  follow  the  first  approach  and  use  features  to  define  the 
similarity  between  graphs.  The  A-distance,  a  spectral  method  which  defines  the  distance  between 
two  graphs  as  the  distance  between  their  spectra  (eigenvalues)  has  been  studied  thoroughly 
([BDKW06,  Pea03,  WZ08],  algebraic  connectivity  [  ie73]).  The  existence  of  co-spectral  graphs 
with  different  structure,  and  the  big  differences  in  the  graph  spectra,  despite  subtle  changes  in 
the  graphs,  are  two  weaknesses  that  add  up  to  the  high  complexity  of  computing  the  whole 
graph  spectrum.  Clearly,  the  spectral  methods  that  call  for  the  whole  spectrum  cannot  scale 
to  the  large-scale  graphs  with  billions  of  nodes  and  edges  that  are  of  interest  currently.  Also, 
depending  on  the  graph-related  matrix  that  is  considered  (adjancency,  laplacian,  normalized 
laplacian),  the  distance  between  the  graphs  is  different.  As  we  show  in  Section  8.4,  these  methods 
fail  to  satisfy  one  or  more  of  the  desired  properties  for  graph  comparison.  [LSYZ1  ]  proposes  an 
SVM-based  approach  on  some  global  feature  vectors  (including  the  average  degree,  eccentricity, 
number  of  nodes  and  edges,  number  of  eigenvalues,  and  more)  of  the  graphs  in  order  to  perform 
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graph  classification.  Macindoe  and  Richards  [MRU  ]  focus  on  social  networks  and  extract  three 
socially  relevant  features:  leadership,  bonding  and  diversity.  The  complexity  of  the  last  feature 
makes  the  method  applicable  to  graphs  with  up  to  a  few  million  edges.  Other  techniques  include 
computing  edge  curvatures  under  heat  kernel  embedding  [  H08],  comparing  the  number  of 
spanning  trees  [  Cel76],  comparing  graph  signatures  consisting  of  summarized  local  structural 
features  [  KERF13],  and  a  distance  based  on  graphlet  correlation  [YMDD  14]. 

The  second  approach  first  solves  the  graph  matching  or  alignment  problem  -  i.e.,  finds  the  ‘best’ 
correspondence  between  the  nodes  of  the  two  graphs-  and  then  finds  the  distance  (or  similarity) 
between  the  graphs.  [  ]  reviews  graph  matching  algorithms  in  pattern  recognition.  There 

are  over  150  publications  that  attempt  to  solve  the  graph  alignment  problem  under  different 
settings  and  constraints.  The  methods  span  from  genetic  algorithms  to  decision  trees,  clustering, 
expectation-maximization  and  more.  Some  recent  methods  that  are  more  efficient  for  large 
graphs  include  a  distributed,  belief-propagation-based  method  for  protein  alignment  [  ], 

another  message-passing  algorithm  for  aligning  sparse  networks  when  some  possible  matchings 
are  given  [  ],  and  a  gradient-descent-based  method  for  aligning  probabilistically  large 

bipartite  graphs  (Chapter  8,  [KTLT  ]). 

The  third  approach  uses  kernels  between  graphs,  which  were  introduced  in  2010  by  [  KB  10]. 
Graph  kernels  work  directly  on  the  graphs  without  doing  feature  extraction.  They  compare  graph 
structures,  such  as  walks  [  <TI03,  GFW03],  paths  [BK05],  cycles  [  IGW04],  trees  [RG03,  MV09], 
and  graphlets  [  1VP  1  09,  CD10]  which  can  be  computed  in  polynomial  time.  A  popular  graph  kernel 
is  the  random  walk  graph  kernel  [KTI03,  GFW03],  which  finds  the  number  of  common  walks  on 
the  two  input  graphs.  The  simple  version  of  this  kernel  is  slow,  requiring  0(n6)  runtime,  but  can  be 
sped  up  to  0(n3)  by  using  the  Sylvester  equation.  In  general,  the  above-mentioned  graph  kernels 
do  not  scale  well  to  graphs  with  more  than  100  nodes.  A  faster  implementation  of  the  random 
walk  graph  kernel  with  0(n2)  runtime  was  proposed  by  Kang  et  al.  [  TS12],  The  fastest  kernel  to 
date  is  the  subtree  kernel  proposed  by  Shervashidze  and  Borgwardt  [SB09,  SSvL'  ],  which  is 
linear  on  the  number  of  edges  and  the  maximum  degree,  0(m  •  d),  in  the  graphs.  The  proposed 
approach  uses  the  Weisfeiler-Lehman  test  of  isomorphism,  and  operates  on  labeled  graphs.  In  our 
work,  we  consider  large,  unlabeled  graphs,  while  most  kernels  require  at  least  0(n3)  runtime  or 
labels  on  the  nodes/edges.  Thus,  we  do  not  compare  them  to  DeltaCon  quantitatively. 

Remark.  Both  research  problems  -  graph  similarity  with  given  or  missing  node  correspondence- 
are  important,  but  apply  in  different  settings.  If  the  node  correspondence  is  available,  the  algo¬ 
rithms  that  make  use  of  it  can  only  perform  better  than  the  methods  that  omit  it.  Our  work  tackles 
the  former  problem. 

Node  Affinity.  There  are  numerous  node  affinity  algorithms;  Pagerank  [  5P98],  Personalized 
Random  Walks  with  Restarts  [  lav03],  the  electric  network  analogy  [  884],  SimRank  [JW02]  and 
extensions/improvements  [YLZ  1  13],  [  10],  and  Belief  Propagation  [YFW03]  are  only  some 

examples  of  the  most  successful  techniques.  In  this  chapter  we  focus  on  the  latter  method,  and 
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specifically  a  fast  variation  that  we  introduced  in  Chapter  4  ([KKK 1  11]).  All  the  techniques  have 
been  used  successfully  in  many  tasks,  such  as  ranking,  classification,  malware  and  fraud  detection 
([CNW+  l],[MBA+09]),  and  recommendation  systems  [  ESI  ]. 

Anomaly  Detection.  Anomaly  detection  in  static  graphs  has  been  studied  using  various  data 
mining  and  statistical  techniques  [ATK14,  KC10,  ACK+12,  LKKF13,  KLKF14].  Detection  of  anoma¬ 
lous  behaviors  in  time-evolving  networks  is  more  relevant  to  our  work,  and  is  covered  in  the 
surveys  [  ,  ] .  A  non-inclusive  list  of  works  on  temporal  graph  anomaly  detection 

follows.  [MGF1  ],  [KPF1  ]  and  [MWP  1  14]  employ  tensor  decomposition  to  identify  anomalous 
substructures  in  graph  data  in  the  context  of  intrusion  detection.  Henderson  et  al.  propose  a 
multi-level  approach  for  identifying  anomalous  behaviors  in  volatile  temporal  graphs  based  on 
iteratively  pruning  the  temporal  space  using  multiple  graph  metrics  [  1C].  CopyCatch 

[  ]  is  a  clustering-based  MapReduce  approach  to  identify  lockstep  behavior  in  "Page  Like" 

patterns  on  Facebook.  Akoglu  and  Faloutsos  use  local  features  and  the  node  eigen-behaviors  to 
detect  points  of  change  -  when  many  of  the  nodes  behave  differently  -,  and  also  spot  nodes  that 
are  the  most  responsible  for  the  change  point  [AF10].  Finally,  [  ID1  ]  monitors  changes  in  the 
commute  time  between  all  pairs  of  nodes  to  detect  anomalous  nodes  and  edges  in  time-evolving 
networks.  All  these  works  use  various  approaches  to  detect  anomalous  behaviors  in  dynamic 
graphs,  though  they  are  not  based  on  the  similarity  between  graphs,  which  is  the  focus  of  our  work. 

Node/Edge  Attribution.  Some  of  the  anomaly  detection  methods  discover  anomalous  nodes, 
and  other  anomalous  structures  in  the  graphs.  In  a  slightly  different  context,  a  number  of 
techniques  have  been  proposed  in  the  context  of  node  and  edge  importance  in  graphs.  PageRank, 
HITS  [  le9  ]  and  betweenness  centrality  (random-walk-based  [New05]  and  shortest-path-based 
[  re77])  are  several  such  methods  for  the  purpose  of  identifying  important  nodes.  [TPER+  12] 
proposes  a  method  to  determine  edge  importance  for  the  purpose  of  augmenting  or  inhibiting 
dissemination  of  information  between  nodes.  To  the  best  of  our  knowledge,  this  and  other  existing 
methods  focus  only  on  identifying  important  nodes  and  edges  in  the  context  of  a  single  graph. 
In  the  context  of  anomaly  detection,  [AF10]  and  [SD1  ]  detect  nodes  that  contribute  mostly  to 
change  events  in  time-evolving  networks. 

Among  these  works,  the  most  relevant  to  ours  are  the  methods  proposed  by  [AF1(  ]  and 
[SD1  ].  The  former  relies  on  the  selection  of  features,  and  tends  to  return  a  large  number  of  false 
positives.  Moreover,  because  of  the  focus  on  local  egonet  features,  it  may  not  distinguish  between 
small  and  large  changes  in  time-evolving  networks  [  ].  At  the  same  time,  and  independently 

from  us,  Sricharan  and  Das  proposed  CAD  [  ],  a  method  which  defines  the  anomalousness  of 

edges  based  on  the  commute  time  between  nodes.  The  commute  time  is  the  expected  number 
of  steps  in  a  random  walk  starting  at  i,  before  node  j  is  visited  and  then  node  i  is  reached  again. 
This  method  is  closely  related  to  DeltaCon  as  Belief  Propagation  (the  heart  of  our  method),  and 
Random  Walks  with  Restarts  (the  heart  of  CAD)  are  equivalent  under  certain  conditions  [KKK+  ] . 
However,  the  methods  work  in  different  directions:  DeltaCon  first  identifies  the  most  anomalous 
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nodes,  and  then  defines  the  anomalousness  of  edges  as  a  function  of  the  outlierness  of  the  adjacent 
nodes;  CAD  first  identifies  the  most  anomalous  edges,  and  then  defines  all  their  adjacent  nodes  as 
anomalous  without  ranking  them.  Our  method  does  not  only  find  anomalous  nodes  and  edges 
in  a  graph,  but  also  (i)  ranks  them  in  decreasing  order  of  anomalousness  (which  can  be  used  for 
guiding  attention  to  important  changes)  and  (ii)  quantifies  the  difference  between  two  graphs 
(which  can  also  be  used  for  graph  classification,  clustering  and  other  tasks) . 


7.7  Summary 

In  this  chapter,  we  have  tackled  the  problem  of  graph  similarity  when  the  node  correspondence  is 
known,  such  as  similarity  in  time-evolving  phone  networks.  Our  contributions  are: 

•  Axioms/Properties:  We  have  formalized  the  problem  of  graph  similarity  by  providing 
axioms,  and  desired  properties. 

•  Effective  and  Scalable  Algorithm:  We  have  proposed  DeltaCon,  a  graph  similarity  algo¬ 
rithm  that  is  (a)  principled  (axioms  A1-A3,  in  Section  7.1),  (b)  intuitive  (properties  P1-P4, 
in  Section  7.4),  and  (c)  scalable,  needing  on  commodity  hardware  ~160  seconds  for  a  graph 
with  over  67  million  edges.  We  have  also  introduced  DeltaCon-Attr,  a  scalable  method 
enabling  node/edge  attribution  for  differences  between  graphs. 

•  Experiments  on  Real  Graphs:  We  have  evaluated  the  intuitiveness  of  DeltaCon,  and 
compared  it  to  six  state-of-the-art  measures  by  using  various  synthetic  and  real,  big  graphs. 
We  have  also  shown  how  to  use  DeltaCon  and  DeltaCon-Attr  for  temporal  anomaly 
detection  (ENRON),  clustering  &  classification  (brain  graphs),  as  well  as  recovery  of  test- 
retest  brain  scans. 

Future  directions  include  extending  DeltaCon  to  handle  streaming  graphs,  incremental  similarity 
updates  for  time-evolving  graphs,  as  well  as  graphs  with  node  and/or  edge  attributes. 
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Chapter  based  on  work  that  appeared  at  ICDM  2013  [  Lie  ]. 


Chapter  8 

Graph  Alignment 


Can  we  spot  the  same  people  in  two  different  social  networks,  such  as  Linkedln  and  Facebook? 
How  can  we  find  similar  people  across  different  graphs?  How  can  we  effectively  link  an  information 
network  with  a  social  network  to  support  cross-network  search?  In  all  these  settings,  a  key  step  is  to 
align1  the  two  graphs  in  order  to  reveal  similarities  between  the  nodes  of  the  two  networks. While 
in  the  previous  chapter  we  focused  on  computing  the  similarity  between  two  aligned  networks, 
in  this  chapter  we  focus  on  aligning  the  nodes  of  two  graphs,  when  that  information  is  missing. 
Informally,  the  problem  we  tackle  is  defined  as  follows: 

Problem  Definition  8.  [Graph  Alignment  or  Matching  -  Informal] 

•  Given:  two  graphs,  Ga(V,a,  £.a)  and  Gb(V®,  £®)  where  V  and  £  are  their  node  and  edge 
sets,  respectively 

•  Find:  how  to  permute  their  nodes,  so  that  the  graphs  have  as  similar  structure  as  possible. 


Graph  alignment  is  a  core  building  block  in  many  disciplines,  as  it  essentially  enables  us  to  link 
different  networks  so  that  we  can  search  and/or  transfer  valuable  knowledge  across  them.  The 
notions  of  graph  similarity  and  alignment  appear  in  many  disciplines  such  as  protein-protein  align¬ 
ment  ([BGG+09],[BBM+10]),  chemical  compound  comparison  [SHL08],  information  extraction 
for  finding  synonyms  in  a  single  language,  or  translation  between  different  languages  [BGSW13], 
similarity  query  answering  in  databases  [MGMR02],  and  pattern  recognition  ([  FSV04],[  'BV09]). 

In  this  chapter  we  primarily  focus  on  the  alignment  of  bipartite  graphs,  i.e.,  graphs  whose 
edges  connect  two  disjoint  sets  of  vertices  (that  is,  there  are  no  edges  within  the  two  node  sets) . 
Bipartite  graphs  stand  for  an  important  class  of  real  graphs  and  appear  in  many  different  settings, 
such  as  author-conference  publishing  graphs,  user-group  membership  graphs,  and  user-movie 
rating  graphs.  Despite  their  ubiquity,  most  (if  not  all)  of  the  existing  work  on  graph  alignment  is 
tailored  for  unipartite  graphs  and,  thus,  might  be  sub-optimal  for  bipartite  graphs. 


throughout  this  work  we  use  the  words  “align(ment)”  and  “match(ing)”  interchangeably. 
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Our  contributions  are: 

1.  Problem  Formulation:  We  introduce  a  powerful  primitive  with  new  constraints  for  the 
graph  matching  problem. 

2.  Effective  and  Scalable  Algorithm:  We  propose  an  effective  and  fast  procedure,  BiG-Align, 
to  solve  the  constrained  optimization  problem  with  careful  handling  of  many  subtleties. 
Then,  we  further  generalize  it  for  matching  unipartite  graphs  (Uni-Align). 

3.  Experiments  on  Real  Graphs:  We  conduct  extensive  experiments,  which  demonstrate  that 
our  algorithms,  BiG-Align  and  Uni-Align,  are  superior  to  existing  graph  matching  bigals 
in  terms  of  both  accuracy  and  and  efficiency.  BiG-Align  is  up  to  lOx  more  accurate  and 
174x  faster  than  competitive  methods  on  real  graphs. 

This  chapter  is  organized  as  follows:  Section  8.1  presents  the  formal  definition  of  the  graph 
matching  problem  we  address.  Section  8.2  describes  our  proposed  method,  and  Section  8.4 
presents  our  experimental  results.  Finally,  we  give  the  related  work,  discussion,  and  conclusions  in 
Sections  8.5,  8.6,  and  8.7,  respectively. 

8.1  Proposed  Problem  Formulation 

In  the  past  three  decades,  numerous  communities  have  studied  the  problem  of  graph  alignment, 
as  it  arises  in  many  settings.  However,  most  of  the  research  has  focused  on  unipartite  graphs,  i.e., 
graphs  that  consist  of  only  one  type  of  nodes.  Formally,  the  problem  that  has  been  addressed  in 
the  past  is  the  following:  Given  two  unipartite  graphs,  Ga  and  Gb,  with  adjacency  matrices  A  and 
B,  find  the  permutation  matrix  P  that  minimizes  the  cost  function  funG 

min funi (P)  =  mini |PAPT  -B|£, 

p  p 

where  ||  •  ||f  is  the  Frobenius  norm  of  the  corresponding  matrix.  We  list  the  frequently  used  symbols 
in  Table  8.1.  The  permutation  matrix  P  is  a  square  binary  matrix  with  exactly  one  entry  1  in  each 
row  and  column,  and  Os  elsewhere.  Effectively,  it  reorders  the  rows  of  the  adjacency  matrix  A, 
while  its  transpose  reorders  the  columns  of  the  matrix,  so  that  the  resulting  reordered  matrix  is 
“close”  to  B. 

In  this  work,  we  introduce  the  problem  of  aligning  bipartite  graphs.  One  example  of  such 
graphs  is  the  user-group  graph;  the  first  set  of  nodes  consists  of  users,  the  second  set  of  groups,  and 
the  edges  represent  user  memberships.  Throughout  the  chapter  we  will  consider  the  alignment 
of  the  “user-group”  Linkedln  graph  (A)  with  the  “user-group”  Facebook  graph  (B).  In  a  more 
general  setting,  the  reader  may  think  of  the  first  set  consisting  of  nodes,  and  the  second  set  of 
communities. 
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Table  8.1:  Description  of  major  symbols. 


Notation  Description 


A,  B 
At,Bt 

V.a,Vs 

£a, 

TlAl,  nA2 
tlBlj  ttB2 


adjacency  matrix  of  bipartite  graph  Ga,  Gb 
transpose  of  matrix  A,  B 
set  of  nodes  of  A,  B 
set  of  edges  of  A,  B 

number  of  nodes  of  graph  A  in  sets  1  and  2,  respectively 
number  of  nodes  of  graph  B  in  sets  1  and  2,  respectively 


P  user-level  (node-level)  correspondence  matrix 

Q  group-level  (community-level)  correspondence  matrix 

p(v)  row  or  column  vector  of  matrix  P 

1  vector  of  Is 

||A||f  =  \/Tr(ATA),  Frobenius  norm  of  A 

A,  p.  sparsity  penalty  parameters  for  P,  Q,  respectively  (equivalent  to  lasso  regularization) 

r|p,  t|q  step  of  gradient  descent  for  P,  Q 

e  small  constant  (>  0)  for  the  convergence  of  gradient  descent 


First,  we  extend  the  traditional  unipartite  graph  alignment  problem  definition  to  bipartite 
graphs: 


Problem  Definition  9.  [Adaptation  of  traditional  definition]  Given  two  bipartite  graphs,  Ga 
and  Gb,  with  adjacency  matrices  A  and  B,  find  the  permutation  matrices  P  and  Q  that  minimize 
the  cost  function  fo: 

minf0(P,Q)  =  min  ||PAQ  -  B||p, 

where  ||  •  ||p  is  the  Frobenius  norm  of  the  matrix. 

We  note  that  in  this  case  there  are  two  different  permutation  matrices  that  reorder  the  rows 
and  columns  of  A  “independently”.  However,  this  formulation  has  two  main  shortcomings: 

[51]  It  is  hard  to  solve,  due  to  its  combinatorial  nature. 

[52]  The  permutation  matrices  imply  that  we  are  in  search  for  hard  assignments  between  the  nodes 
of  the  input  graphs.  However,  finding  hard  assignments  might  not  be  possible  nor  realistic.  For 
instance,  in  the  case  of  input  graphs  with  a  perfect  ‘star’  structure,  aligning  their  spokes  (peripheral 
nodes)  is  impossible,  as  they  are  identical  from  the  structural  viewpoint.  In  other  words,  any  way 
of  aligning  the  spokes  is  equiprobable.  In  such  cases,  as  well  as  more  complicated  and  realistic 
cases,  soft  assignment  may  be  more  valuable  than  hard  assignment. 

To  deal  with  these  issues,  we  relax  Problem  9  that  is  directly  adapted  from  the  well-studied 
case  of  unipartite  graphs,  and  state  it  in  a  more  realistic  way: 
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Problem  Definition  10.  [Soft,  Sparse  Bipartite  Graph  Alignment]  Given  two  bipartite  graphs, 
Ga  and  Gg,  with  adjacency  matrices  A  and  B,  find  the  correspondence  matrices  P,  Q  that 
minimize  the  cost  function  f : 

minf(P,  Q)  =  min  ||PAQ  -  B||| 

P,Q  P,Q 

under  the  following  constraints: 

(1)  [Probabilistic]  each  matrix  element  is  a  probability,  i.e.,  0  ^  Pq  <  1  and  0  <  Qq  ^  1,  and 

(2)  [Sparsity]  the  matrices  are  sparse,  i.e.,  ||pfv^||o  ^  t  and  HQ^Ho  t  for  some  small,  positive 
constant  t.  The  ||  •  ||o  denotes  the  lo-norm  of  the  enclosed  vector,  i.e.,  the  number  of  its 
non-zero  elements. 


The  first  constraint,  the  requirement  of  non-integer  entries  for  the  matrices,  has  two  advantages: 
[Al]  It  solves  both  shortcomings  of  the  traditional-based  problem.  The  optimization  problem  is 
easier  to  solve,  and  has  a  realistic,  probabilistic  interpretation;  it  does  not  provide  only  the  1-to-l 
correspondences,  but  also  reveals  the  similarities  between  the  nodes  across  networks.  The  entries 
of  the  correspondence  matrix  P  (or  Q)  describe  the  probability  that  a  Linkedln  user  (or  group) 
corresponds  to  a  Facebook  user  (or  group).  We  note  that  these  properties  are  not  guaranteed 
when  the  correspondence  matrix  is  required  to  be  a  permutation  or  even  doubly  stochastic  (square 
matrix  with  non-negative  real  entries,  where  each  row  and  column  sums  to  1),  which  is  common 
practice  in  the  literature. 

[A2]  The  matrices  P  and  Q  do  not  have  to  be  square,  which  means  that  the  matrices  A  and  B  can 
be  of  different  size.  This  is  yet  another  realistic  requirement,  as  very  rarely  do  two  networks  have 
the  same  number  of  nodes.  Therefore,  our  formulation  addresses  not  only  graph  alignment,  but 
also  subgraph  alignment. 


The  second  constraint  follows  naturally  from  the  first  one,  as  well  as  the  large  size  of  the  social, 
and  other  networks.  We  want  the  correspondence  matrices  to  be  as  sparse  as  possible,  so  that  they 
encode  few  potential  correspondences  per  node.  Allowing  every  user/group  of  Linkedln  to  be 
matched  to  every  user/group  of  Facebook  is  not  realistic  and,  actually,  it  is  problematic  for  large 
graphs,  as  it  has  quadratic  space  cost  with  respect  to  the  size  of  the  input  graphs. 


To  sum  up,  the  existing  approaches  do  not  distinguish  the  nodes  by  types  (e.g.,  users  and 
groups),  treat  the  graphs  as  unipartite,  and,  thus,  aim  at  finding  a  permutation  matrix  P,  which 
gives  a  hard  assignment  between  the  nodes  of  the  input  graphs.  In  contrast,  our  formulation 
separates  the  nodes  in  categories,  and  can  find  correspondences  at  different  granularities  at  once 
(e.g.,  individual  and  group-level  correspondence  in  the  case  of  the  “user-group”  graph). 
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8.2  Proposed  Method:  Big -Align  for  Bipartite  Graphs 

Now  that  we  have  formulated  the  problem,  we  move  on  to  the  description  of  a  technique  to  solve 
it.  Our  objective  is  two-fold:  i)  In  terms  of  effectiveness,  given  the  non-convexity  of  Problem  10, 
our  goal  is  to  find  a  ‘good’  local  minimum;  ii)  In  terms  of  efficiency,  we  focus  on  carefully  designing 
the  search  procedure.  The  two  key  idas  of  our  method,  BiG-Align,  are: 

•  An  alternating,  projected  gradient  descent  approach  to  find  the  local  minima  of  the  newly- 
defined  optimization  problem  (Problem  10),  and 

•  A  series  of  optimizations:  (a)  a  network-inspired  initialization  (Net-Init)  of  the  corres¬ 
pondence  matrices  to  find  a  good  starting  point,  (b)  automatic  choice  of  the  steps  for  the 
gradient  descent,  and  (c)  handling  the  node-multiplicity  problem,  i.e.,  the  “problem”  of 
having  nodes  with  exactly  the  same  structure  (e.g.,  peripheral  nodes  of  a  star)  to  improve 
both  effectiveness  and  efficiency. 

Next,  we  start  by  building  the  core  of  our  method,  continue  with  the  description  of  the  three 
optimizations,  and  conclude  with  the  pseudocode  of  the  overall  algorithm. 


8.2.1  Alternating  Projected  Gradient  Descent  (APGD) :  Mathematical 
formulation 

Following  the  standard  approach  in  the  literature,  in  order  to  solve  the  optimization  problem 
(Problem  10),  we  propose  to  first  relax  the  sparsity  constraint,  which  is  mathematically  represented 
by  the  lo-norm  of  the  matrices’  columns,  and  replace  it  with  the  li-norm,  |p|v)  =  p[v|, 

where  we  also  use  the  probabilistic  constraint.  Therefore,  the  sparsity  constraint  now  takes 
the  form:  ,)  ^vj  ^  t  and  Y-i,)  Qij  ^  By  using  this  relaxation  and  applying  linear  algebra 

operations,  the  bipartite  graph  alignment  problem  takes  the  following  form. 

^-(Theorem  IT) 

[Augmented  Cost  Function]  The  optimization  problem  for  the  alignment  of  the  bipartite 
graphs  Ga  and  Gb,  with  adjacency  matrices  A  and  B,  under  the  probabilistic  and  sparsity 
constraints  (Problem  10),  is  equivalent  to: 

minfaug(P,Q)  =  min{||PAQ  —  B|||  +  A^  pq  + 

i,j  i,j 

=  iriin{||PAQ||p  -  2Tr(PAQBT)  +  A1TP1  +  qlTQl},  (8.1) 

where  ||  •  ||f  is  the  Frobenius  norm  of  the  enclosed  matrix,  P  and  Q  are  the  user-  and 
group-level  correspondence  matrices,  and  A  and  q  are  the  sparsity  penalties  of  P  and  Q, 
respectively. 
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Proof.  The  minimization 


min  |  |PAQ  —  B||| 

P,Q 


(Problem  10) 


can  be  reduced  to 


min  (|  |PAQ|  \l  -  2  Tr  PAQBT}. 

P,Q 


Starting  from  the  definition  of  the  Frobenius  norm  of  PAQ  —  B  (Table  8.1),  we  obtain: 


||PAQ  -  B|||  =  Tr  (PAQ  -  B)(PAQ  -  B)T 


=  Tr(PAQ(PAQ)T  -  2PAQB1)  +  Tr  (BBT) 
=  IIPAQH!  -  2Tr  (PAQBt)  +  Tr  (BBT), 


(8.2) 

where  we  used  the  property  Tr  (PAQBT)  =  Tr  (PAQBT)T.  Given  that  the  last  term,  Tr  (BBT),  does 


not  depend  on  P  or  Q,  it  does  not  affect  the  minimization. 


In  summary,  we  solve  the  minimization  problem  by  using  a  variant  of  the  gradient  descent 
algorithm.  Given  that  the  cost  function  in  Equation  8.1  is  bivariate,  we  use  an  alternating  procedure 
to  minimize  it.  We  fix  Q  and  minimize  faug  with  respect  to  P,  and  vice  versa.  If,  during  the 
two  alternating  minimization  steps,  the  entries  of  the  correspondence  matrices  become  invalid 
temporarily,  we  use  a  projection  technique  to  guarantee  the  probabilistic  constraint:  If  Pq  <  0  or 
Qq  <  0,  we  project  the  entry  to  0.  If  Pq  >  1  or  Q  tj  >  1,  we  project  it  to  1.  The  update  steps  of  the 
alternating,  projected  gradient  descent  approach  (APGD)  are  given  by  the  following  theorem. 

[Theorem  2.] - v 

[Update  Step]  The  update  steps  for  the  user-  (P)  and  group-level  (Q)  correspondence 
matrices  of  APGD  are  given  by: 

P(k+i)  =p(k)  _r[p  .  (2(P(k)AQ(k)  -B)QT(k,AT  +A11t) 

Q(k+n  =  Q(k)  _ r)Q  .  (2ATPT(k+1)  (P(k+1) AQ(k)  -  B)  +  pllT)  , 

where  Pfk|,  Q(k|  are  the  correspondence  matrices  at  iteration  k,  qp  and  q q  are  the  steps 
of  the  two  phases  of  the  APGD  and  1  is  the  all-1  column- vector. 


Proof.  The  update  steps  for  gradient  descent  are: 


(8.3) 


(8.4) 
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First,  we  compute  the  derivative  of  f  with  respect  to  P  by  using  properties  of  matrix  derivatives: 


9f(P,Q)  _  9(||PAQ||p  —  2Tr (PAQBT)) 

9P  _  9P 

_  0Tr(PAQQTATPT)  o0Tr(PAQBT) 

_  0P  0P 

=  2(PAQ  -  B)QtAt  (8.5) 

Then,  by  using  properties  of  matrix  derivatives  and  the  invariant  property  of  the  trace  under 
cyclic  permutations  Tr  (PAQQtAtP)  =  Tr  (AtPtPAQQt),  we  obtain  the  derivative  of  f  (P,  Q)  with 
respect  to  Q: 


0f(P,Q)  _  0(||PAQ|||  —  2TrPAQBT) 

0Q  "  0Q 

_  9Tr  (PAQQtAtPt)  -  2Tr  (PAQBT) 

“  9Q 

_  0Tt(AtPtPAQQt)  o0Tr(PAQBT) 

"  9Q  0Q 

=  (AtPtPA  +  (AtPtPA)t)Q  -  2(PA)t(Bt)t 

=  2AtPt(PAQ  —  B)  (8.6) 


Finally,  the  partial  derivatives  of  s(P,  Q)  with  respect  to  P  and  Q  are 

3s(P,  Q)  3(1tP1  +  1tQ1) 


3P 

as(p.Q) 


3P 

3(1tP1  +  1tQ1) 


11' 


11' 


(8.7) 


(8.8) 


3Q  3P 

By  substituting  Equations  8.5  and  8.7  in  Equation  8.3,  we  obtain  the  update  step  for  P.  Similarly, 
by  substituting  Equations  8.6  and  8.8  in  Equation  8.4,  we  get  the  update  step  for  Q.  ■ 


We  note  that  the  assumption  in  the  above  formulas  is  that  A  and  B  are  rectangular,  adjacency 
matrices  of  bipartite  graphs.  It  turns  out  that  this  formulation  has  a  nice  connection  to  the  standard 
formulation  for  unipartite  graph  matching  if  we  treat  the  input  bipartite  graphs  as  unipartite 
(i.e.,  symmetric,  square,  adjacency  matrix).  We  summarize  this  equivalence  in  the  following 
proposition. 
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Proposition  1. 

[Equivalence  to  Unipartite  Graph  Alignment]  If  the  rectangular  adjacency  matrices 
of  the  bipartite  graphs  are  converted  to  square  matrices,  then  the  minimization  is  done 
with  respect  to  the  coupled  matrix  P* : 


That  is,  Problem  10  becomes  minp*||P*AP*T  —  B||p,  which  is  equivalent  to  the  unipartite  graph 
problem  introduced  at  the  beginning  of  Section  8.1. 

8.2.2  Optimizations 

Up  to  this  point,  we  have  the  mathematical  foundation  at  our  disposal  to  build  our  algorithm, 
BiG-Align.  But  first  we  have  to  make  three  design  decisions: 

(Dl)  How  to  initialize  the  correspondence  matrices? 

(D2)  How  to  choose  the  steps  for  the  APGD? 

(D3)  How  to  handle  structurally  equivalent  nodes? 

The  baseline  approach,  which  we  will  refer  to  as  BiG-Align-Basic,  consists  of  the  simplest  an¬ 
swers  to  these  questions:  (Dl)  uniform  initialization  of  the  correspondence  matrices,  (D2)  “small”, 
constant  step  for  the  gradient  descent,  (D3)  no  specific  manipulation  of  the  structurally  equivalent 
nodes.  Next,  we  elaborate  on  sophisticated  choices  for  the  initialization  and  optimization  step  that 
render  our  algorithm  more  efficient.  We  also  introduce  the  “node-multiplicity”  problem,  i.e.,  the 
problem  of  structurally  equivalent  nodes,  and  propose  a  way  to  deal  with  it. 

(Dl)  How  to  initialize  the  correspondence  matrices? 

The  optimization  problem  is  non-convex  (not  even  bi-convex),  and  the  gradient  descent 
gets  stuck  in  local  minima,  depending  heavily  on  the  initialization.  There  are  several  different 
ways  of  initializing  the  correspondence  matrices  P  and  Q,  such  as  random,  degree-based,  and 
eigenvalue-based  [  e8<  ]  [  Of  ] .  While  each  of  these  initializations  has  its  own  rationality, 

they  are  designed  for  unipartite  graphs  and  hence  ignore  the  skewness  of  the  real,  large-scale 
bipartite  graphs.  To  address  this  issue,  we  propose  a  network-inspired  approach  (Net-Init), 
which  is  based  on  the  following  observation  about  large-scale,  real  biparite  graphs: 

Observation  21.  Large,  real  networks  have  skewed  or  power-law-like  degree  distribution  [  , 

,  FF9  ] .  Specifically  in  bipartite  graphs,  usually  one  of  the  node  sets  is  significantly 
smaller  than  the  other,  and  has  skewed  degree  distribution. 

The  implicit  assumption2  of  Net-Init  is  that  a  person  is  almost  equally  popular  in  different 

2If  the  assumption  does  not  hold,  no  method  is  guaranteed  to  find  the  alignment  based  purely  on  the 
structure  of  the  graphs,  but  they  can  still  reveal  similarities  between  nodes. 
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cluster  n 


(a)  Scree-like  plot  for  Net-Init. 


(b)  Pictorial  initialization  of  P. 


Figure  8.1:  (a)  Choice  of  k  in  Step  1  of  Net-Init.  (b)  Initialization  of  the  node/user-level  corres¬ 
pondence  matrix  by  Net-Init. 


social  networks,  or,  more  generally,  an  entity  has  similar  “behavior”  across  the  input  graphs.  In 
our  work,  we  have  found  that  such  behavior  can  be  well  captured  by  the  node  degree.  However, 
the  technique  we  describe  below  can  be  naturally  applied  to  other  features  (e.g.,  weight,  ranking, 
clustering  coefficient)  that  may  capture  the  node  behavior  better. 

Our  initialization  approach  consists  of  four  steps.  For  the  description  of  our  approach,  we  refer 
to  the  example  of  the  Linkedln  and  Facebook  bipartite  graphs,  where  the  first  set  consists  of  users, 
and  the  second  set  of  groups.  For  the  sake  of  the  example,  we  assume  that  the  set  of  groups  is 
significantly  smaller  than  the  set  of  users. 

Step  1.  Match  1-by-l  the  top-k  high-degree  groups  in  the  Linkedln  and  Facebook  graphs.  To 
find  k,  we  borrow  the  idea  of  the  scree  plot,  which  is  used  in  Principal  Component  Analysis  (PCA) : 
we  sort  the  unique  degrees  of  each  graph  in  descending  order,  and  create  the  plot  of  unique  degree 
vs.  rank  of  node  (Figure  8.1(a)).  In  this  plot,  we  detect  the  “knee”  and  up  to  the  corresponding 
degree  we  “safely”  match  the  groups  of  the  two  graphs  one-by-one,  i.e.,  the  most  popular  group 
of  Linkedln  is  aligned  initially  with  the  most  popular  group  of  Facebook,  etc.  For  the  automatic 
detection  of  the  knee,  we  consider  the  plot  piecewise,  and  assume  that  the  knee  occurs  when  the 
slope  of  a  line  segment  is  less  than  5%  of  the  slope  of  the  previous  segment. 


Step  2.  For  each  of  the  matched  groups,  we  propose  to  align  their  neighbors  based  on  their 
Relative  Degree  Difference  (RDD): 


Definition  7.  [RDD  ]  The  Relative  Degree  Distance  function  that  aligns  node  i  of  graph  A  to  node 
j  of  B  is: 


rdd(i,j) 


/  |deg(i)  —  deg(j)|  \ 
V  +  (deg(i)  +  deg(j))/2/ 


(8.9) 


where  deg(«)  is  the  degree  of  the  corresponding  node. 


The  idea  behind  this  approach  is  that  a  node  in  one  graph  more  probably  corresponds  to  a 
node  with  similar  degree  in  another  graph,  than  to  a  node  with  very  different  degree.  The  above 
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function  assigns  higher  probabilities  to  alignments  of  similar  nodes,  and  lower  probabilities  to 
alignments  of  very  dissimilar  nodes  with  respect  to  their  degrees. 

We  note  that  the  RDD  function,  rdd(i,j),  corresponds  to  the  degree-based  similarity  between 
node  i  and  node  j .  However,  it  can  be  generalized  to  other  properties  that  the  nodes  are  expected 
to  share  across  different  graphs.  Equation  8.9  captures  one  additional  desired  property:  it  penalizes 
the  alignments  based  on  the  relative  difference  of  the  degrees.  For  example,  two  nodes  of  degrees 
1  and  20,  respectively,  are  less  similar  than  two  nodes  with  degrees  1001  and  1020. 

Step  3.  Create  cg  clusters  of  the  remaining  groups  in  both  networks,  based  on  their  de¬ 
grees.  Align  the  clusters  1-by-l  according  to  the  degrees  (e.g.,  “high”,  “low”),  and  initialize  the 
correspondences  within  the  matched  clusters  using  the  RDD. 


Step  4.  Create  cu  clusters  of  the  remaining  users  in  both  networks,  based  on  their  degrees. 
Align  the  users  using  the  RDD  approach  within  the  corresponding  user  clusters. 


(D2)  How  to  choose  the  steps  for  the  APGD  method? 

One  of  the  most  important  parameters  that  come  up  in  the  APGD  method  is  q  (the  step 
of  approaching  the  minimum  point),  which  determines  its  convergence  rate.  In  an  attempt  to 
automatically  determine  the  step,  we  use  the  line  search  approach  [BV04  ],  which  is  described  in 
Algorithm  8.9.  Line  search  is  a  strategy  that  finds  the  local  optimum  for  the  step.  Specifically,  in 
the  first  phase  of  APGD,  line  search  determines  qP  by  treating  the  objective  function,  faug,  as  a 
function  of  qp  (instead  of  a  function  of  P  or  Q)  and  loosely  minimizing  it.  In  the  second  phase  of 
APGD,  q  q  is  determined  similarly.  Next  we  introduce  3  variants  of  our  method  that  differ  in  the 
way  the  steps  are  computed. 

Variant  1:  BiG-ALiGN-Points.  Our  first  approach  consists  of  approximately  minimizing  the 
augmented  cost  function:  we  randomly  pick  some  values  for  qp  within  some  “reasonable”  range, 
and  compute  the  value  of  the  cost  function.  We  choose  the  step  qp  that  corresponds  to  the 
minimum  value  of  the  cost  function.  We  define  q  q  similarly.  This  approach  is  computationally 
expensive,  as  we  shall  see  in  Section  8.4. 

Variant  2:  BiG-ALiGN-Exact.  By  carefully  handling  the  objective  function  of  our  optimization 
problem,  we  can  find  the  closed  (exact)  forms  for  qp  and  qQ,  which  are  given  in  the  next  theorem. 

[theorem  3.] 

[Optimal  Step  Size  for  P]  In  the  first  phase  of  APGD,  the  value  of  the  step  qp  that 
exactly  minimizes  the  augmented  function,  faugOnp);  is  given  by: 


2Tr{(P^AQ)(APAQ)T  -  (APAQ)BT}  +  APlj 
2||AP  AQ||p 


(8.10) 


where  P(k+1)  =  p(k)  -qPAP,  AP  =  VPf aug lP=P(k)  and  Q  =  Q(k). 
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Proof.  To  find  the  step  pp  that  minimizes  faitgOlp),  we  take  its  derivative  and  set  it  to  0: 


(Ic+lh 


df aitg  d(Tr{P^k+1^AQ(P^k+1^AQ)T  —  2P^k+1^AQBT}  +  A  Y-i,) 

drip  dr|P 

where  p(k+1)  =  p(k)  —  ripAp,  where  Ap  =  VpfaUglp=p(k).  It  also  holds  that 

Tr(P(k+1)AQ(P(k+1)AQ)7)  -  2P(k+1)AQBT)  = 
||P(k)AQ|||  -2TrP(k)AQB7  +r|^||ApAQ|||+ 

+2r|P  Tr  (APAQBT)  -  2riP  Tr  (P(k)AQ)(APAQ) 


=  0, 


(8.11) 


(8.12) 


Substituting  Equation  8.12  in  Equation  8.11,  and  solving  for  pp  yields  the  ‘best  value’  of  qp  as 
defined  by  the  line  search  method.  ■ 

Similarly,  we  find  the  appropriate  value  for  the  step  Pq  of  the  second  phase  of  APGD. 
[theorem  4.] 

[Optimal  Step  Size  for  Q]  In  the  second  phase  of  APGD,  the  value  of  the  step  qg  that 
exactly  minimizes  the  augmented  function,  faug(hQ),  is  given  by: 


2Tr{(PAQ(ki)(PAAq) 1  -  (PAAQ)B  1 }  +  hLy  AQij 


T)Q  = 


2||PAAq  Up 

where  AQ  =  VQfQuglQ=Q(k),  P  =  P(k),  and  Q(k+0  =  Q™  -r|QAQ. 


(8.13) 


Proof.  The  computations  for  tjq  is  symmetric  to  the  computation  of  qp  (Theorem  3),  and  thus  we 
omit  it.  ■ 

BiG- Align -Exact  is  significantly  faster  than  BiG-ALiGN-Points.  It  turns  out  that  we  can  increase 
the  efficiency  even  more,  as  experimentation  with  real  data  revealed  that  the  values  of  the  gradient 
descent  steps  that  minimize  the  objective  function  do  not  change  drastically  in  every  iteration 
(Figure  8.2).  This  led  to  the  third  variation  of  our  algorithm: 

Variant  3:  BiG-ALiGN-Skip.  This  variation  applies  exact  line  search  for  the  first  few  (e.g., 
100)  iterations,  and  then  updates  the  values  of  the  steps  every  few  (e.g.,  500)  iterations.  This 
significantly  reduces  the  computations  for  determining  the  optimal  step  sizes. 

(D3)  How  to  handle  structurally  equivalent  nodes? 

One  last  observation  that  renders  BiG-Align  more  efficient  is  the  following: 

Observation  22.  In  the  majority  of  graphs,  there  is  a  significant  number  of  nodes  that  cannot 
be  distinguished,  because  they  have  exactly  the  same  structural  features. 

For  instance,  in  many  real-world  networks,  a  commonplace  structure  is  stars  [  ],  but  it  is 

impossible  to  tell  the  peripheral  nodes  apart.  Other  examples  of  non-distinguishable  nodes  include 
the  members  of  cliques,  and  full  bipartite  cores  (Chapter  3,  [  ]). 
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(a)  Bipartite  graph  with  50 
nodes. 


(b)  Bipartite  graph  with  300 
nodes. 


(c)  Bipartite  graph  with  900 
nodes. 


Figure  8.2:  Hint  for  speedup:  Size  of  optimal  step  for  P  (blue)  and  Q  (green)  vs.  the  number 
of  iterations.  We  observe  that  the  optimal  step  sizes  do  not  change  dramatically  in  consecutive 
iterations,  and,  thus,  skipping  some  computations  almost  does  not  affect  the  accuracy  at  all. 


To  address  this  problem,  we  introduce  a  pre-processing  phase  at  which  we  eliminate  nodes  with 
identical  structures  by  aggregating  them  in  super-nodes.  For  example,  a  star  with  100  peripheral 
nodes  which  are  connected  to  the  center  by  edges  of  weight  1,  will  be  replaced  by  a  super-node 
connected  to  the  central  node  of  the  star  by  an  edge  of  weight  100.  This  subtle  step  not  only  leads 
to  a  better  optimization  solution,  but  also  improves  the  efficiency  by  reducing  the  scale  of  graphs 
that  are  actually  fed  into  our  BiG -Align. 


8.2.3  BiG-Align:  Putting  everything  together 

The  previous  subsections  shape  up  the  proposed  algorithm,  BiG-Align,  the  pseudocode  of  which 
is  given  in  Algorithms  8.8  and  8.9. 

In  our  implementation,  the  only  parameter  that  the  user  is  required  to  input  is  the  sparsity 
penalty,  A.  The  bigger  this  parameter  is,  the  more  entries  of  the  matrices  are  forced  to  be  0.  We  set 
the  other  sparsity  penalty  p.  =  A*eiee1^ntstSinIpQ^  >  so  that  the  penalty  per  non-zero  element  of  P  and 
Q  is  the  same. 

It  is  worth  mentioning  that,  in  contrast  to  the  approaches  found  in  the  literature,  our  method 
does  not  use  the  classic  Hungarian  algorithm  to  find  the  hard  correspondences  between  the 
nodes  of  the  bipartite  graphs.  Instead,  we  rely  on  a  fast  approximation:  we  align  each  row  i 
(node/user)  of  PT  with  the  column  j  (node/user)  that  has  the  maximum  probability.  It  is  clear 
that  this  assignment  is  very  fast,  and  even  parallelizable,  as  each  node  alignment  can  be  processed 
independently.  Moreover,  it  allows  aligning  multiple  nodes  of  one  graph  with  the  same  node  of 
the  other  graph,  a  property  that  is  desirable  especially  in  the  case  of  structurally  equivalent  nodes. 

Figure  8.3  depicts  how  the  cost  and  accuracy  of  the  alignment  change  with  respect  to  the 
number  of  iterations  of  the  gradient  descent  algorithm. 
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Algorithm  8.8  BiG-ALiGN-Exact:  Bipartite  Graph  Alignment 

Input:  A,  B,  A,  max  iter,  e  =  10-6;  cost(O)  =  0;  k  =  1; 

Output:  The  correspondence  matrices 
/*  STEP  1:  pre-processing  for  node-multiplicity  */ 
aggregating  identical  nodes 
/*  STEP  2:  initialization  */ 

[P0,  Q0]  =  Net-Init 

COSt(l)  =  faitg(P0,Q0) 

/*  STEP  3:  alternating  projected  gradient  descent  (APGD)  */ 

while  |cost(k—  1)  —  cost(k)|/cost(k -  1)  >  e  and  k<  maxiter  do 
k  H — |- 

/*  Phase  1 :  fixed  Q,  minimization  with  respect  to  P  */ 
r|Pk  =  LlNESEARCH_P(Pfk),Qfk),  Vpfauglp=p(k)) 

p(k+l)  =  p(k)  _TlpkVpfaug(p(k)!Q(k)) 

VALIDPROJECTION(P(k+1)) 

/*  Phase  2:  fixed  P,  minimization  with  respect  to  Q  */ 

r|Qk  =  LlNESEARCH_Q(P(k+1),Q(k),  VQf  aUg  lQ=Q(k) ) 

Q(k+1)  =  Q(k)  -TlQlcVQfaUg(P(k+1),Q(k)) 
VALIDPROJECTION(Q(k+1)) 

COSt(k)  =  faitg(P.Q) 

end  while 

RETURN:  P(k+1),  q(^+i) 

/*  Projection  Step  */ 
function  validProjection(P) 
for  all  i,  j 

if  Pt  j  <  0  then  P^  =  0 
else  if  Ptj  >  1  then  P^  =  1 
end  function 


8.3  Uni-Align:  Extension  to  Unipartite  Graphs 

Although  our  primary  target  for  BiG-Align  is  bipartite  graphs  (which  by  themselves  already  stand 
for  a  significant  portion  of  real  graphs),  as  a  side-product,  BiG-Align  also  offers  an  alternative, 
fast  solution  to  the  alignment  problem  of  unipartite  graphs.  Our  approach  consists  of  two  steps: 
Step  1:  Uni-  to  Bi-partite  Graph  Conversion.  The  first  step  involves  converting  the  n  x  n 
unipartite  graphs  to  bipartite  graphs.  Specifically,  we  can  first  extract  d  node  features,  such 
as  degree,  edges  in  a  node’s  egonet  (=  induced  subgraph  of  the  node  and  its  neighbors),  and 
clustering  coefficient.  Then,  we  can  form  the  n  x  d  bipartite  graph  node-to-feature,  where  n»d. 
The  runtime  of  this  step  depends  on  the  time  complexity  of  extracting  the  selected  features. 

Step  2:  Finding  P.  We  note  that  in  this  case,  the  alignment  of  the  feature  sets  of  the  bipartite 
graphs  is  known,  i.e.,  Q  is  an  identity  matrix,  since  we  extract  the  same  type  of  features  from  the 
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Algorithm  8.9  Line  Search  for  rjP  and  rjQ 

function  lineSearch_P(P,  Q,  Ap) 
return 

_  2 Tr {(p(k) AQ) (APAQ)T  -  (APAQ)BT)  +  A  APij 
1115  2||APAQ||2 

end  function 

function  lineSearch_Q(P,  Q,  Aq) 
return 

=  2 Tr {(PAQ(k) ) (PAAq )T  -  (PAAQ)BT}  +  V-Y-ij  AQij 
11(2  “  2||PAAQ||2F 

end  function 


terations  xioH 
(a)  Cost  function. 


Figure  8.3:  BiG-Align  (900  nodes,  A  =  0.1):  As  desired,  the  cost  of  the  objective  function  drops 
with  the  number  of  iterations,  while  the  accuracy  both  on  node-  (green)  and  community-level  (red) 
increases.  The  exact  definition  of  accuracy  is  given  in  Section  8.4.2. 


graphs.  Thus,  we  only  need  to  align  the  n  nodes,  i.e.,  compute  P.  We  revisit  Equation  8.1  of  our 
initial  minimization  problem,  and  now  we  want  to  minimize  it  only  with  respect  to  P.  By  setting 
the  derivative  of  faug  with  respect  to  P  equal  to  0,  we  have: 

P  •  (AAt)  =  BAt  -  A/2  •  11T, 

where  A  is  a  n  x  d  matrix.  If  we  do  SVD  (Singular  Value  Decomposition)  on  this  matrix,  i.e., 
A  =  USV,  the  Moore-Penrose  pseudo-inverse  of  AAT  is  (AAT)t  =  US_2UT.  Therefore,  we  have 

P=  (BAT-A/211T)(AAT)t 
=  (BAt  -  A/211t)(US“2Ut) 

=  B  •  (AtUS-2Ut)  -  1  •  (A/2  •  1TUS-2UT) 

=  B  X-l  Y  (8.14) 
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where  X  =  ATUS_2UT  and  Y  =  A/2  •  1TUS  “2UT.  Hence,  we  can  exactly  (non-iteratively)  find 
P  from  Equation  8.14.  It  can  be  shown  that  the  time  complexity  for  finding  P  is  0(nd2)  (after 
omitting  the  simpler  terms),  which  is  linear  on  the  number  of  nodes  of  the  input  graphs. 

What  is  more,  we  can  see  from  Equation  8.14  that  P  itself  has  the  low-rank  structure.  In  other 
words,  we  do  not  need  to  store  P  in  the  form  of  n  x  n.  Instead,  we  can  represent  (compress) 
P  as  the  multiplication  of  two  low-rank  matrices  X  and  Y,  whose  additional  space  cost  is  just 
0(nd  +  n)  =  O(nd). 

8.4  Experimental  Evaluation 

In  this  section,  we  evaluate  the  proposed  algorithms,  BiG-Align  and  Uni-Align,  with  respect 
to  alignment  accuracy  and  runtime,  and  also  compare  them  to  the  state-of-the-art  methods.  The 
code  for  all  the  methods  is  written  in  Matlab  and  the  experiments  were  run  on  Intel  (R)  Xeon(R) 
CPU  5160  @  3.00GHz,  with  16GB  RAM  memory. 

8.4.1  Baseline  Methods 

To  the  best  of  our  knowledge,  no  graph  matching  algorithm  has  been  designed  for  bipartite  graphs. 
Throughout  this  section,  we  compare  our  algorithms  to  3  state-of-the-art  approaches,  which 
are  succinctly  described  in  Table  8.2:  (i)  Umeyama,  the  influential  eigenvalue  decomposition- 
based  approach  proposed  by  Umeyama  [Ume88];  (ii)  nmf -based,  a  recent  approach  based  on 
Non-negative  Matrix  Factorization  [  JO  ];  and  (iii)  NetAlign-full  and  NetAlign-deg,  two 
variations  of  a  fast,  and  scalable  Belief  Propagation-based  (BP)  approach  [  ] .  Some  details 

about  these  approaches  are  provided  in  the  related  work  (Section  8.5). 

In  order  to  apply  these  approaches  to  bipartite  graphs,  we  convert  the  latter  to  unipartite  by 
using  Proposition  1.  In  addition  to  that,  since  the  BP-based  approach  requires  not  only  the  two 
input  graphs,  but  also  a  bipartite  graph  that  encodes  the  possible  matchings  per  node,  we  use 
two  heuristics  to  form  the  required  bipartite  ‘matching’  graph:  (a)  full  bipartite  graph,  which 
essentially  conveys  that  we  have  no  domain  information  about  the  possible  alignments,  and  each 
node  of  the  first  graph  can  be  aligned  with  any  node  of  the  second  graph  (NetAlign-full); 
and  (b)  degree-based  bipartite  graph,  where  only  nodes  with  the  same  degree  in  both  graphs  are 
considered  possible  matchings  (NetAlign-deg). 

8.4.2  Evaluation  of  BiG-Align 

For  the  experiments  on  bipartite  graphs,  we  use  the  movie-genre  graph  of  the  MovieLens  network3. 
Each  of  the  1,027  movies  is  linked  to  at  least  one  of  the  23  genres  (e.g.,  comedy,  romance,  drama). 
Specifically,  from  this  network,  we  extract  subgraphs  of  different  sizes.  Then,  following  the 
tradition  in  the  literature  [  18] ,  for  each  of  the  subgraphs  we  generate  permutations,  B,  with 

noise  from  0%  to  20%  using  the  formula  Bq  =  (PAQ)q  •  ( 1  +  noise  *  rq ),  where  rq  is  a  random 

3http : / / www .movielens . org 
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Table  8.2:  Graph  Alignment  Algorithms:  name  conventions,  short  description,  type  of  graphs  for 
which  they  were  designed  (‘uni-’  for  unipartite,  ‘bi-’  for  bipartite  graphs),  and  reference. 


Name 

Description 

Graph 

Source 

Umeyama 

eigenvalue-based 

uni- 

[Ume8I  ] 

NMF-based 

NMF-based 

uni- 

[DLJ08] 

NetAlign-full 

BP-based  with  uniform  init. 

uni- 

Modified 

NetAlign-deg 

BP-based  with  same-degree  init. 

uni- 

from  [BGSW13] 

BiG-ALiGN-Basic 

APGD  (no  optimizations) 

bi- 

current 

BiG-ALiGN-Points 

APGD  4-  approx.  Line  Search 

bi- 

current 

BiG-ALiGN-Exact 

APGD  +  exact  Line  Search 

bi- 

current 

BiG-ALiGN-Skip 

APGD  4-  skip  some  Line  Search 

bi- 

current 

Uni-Align 

BiG-ALiGN-inspired  (SVD) 

uni- 

current 

number  in  [0, 1].  For  each  noise  level  and  graph  size,  we  generate  10  distinct  permutations  of  the 
initial  subnetwork.  We  run  the  alignment  algorithms  on  all  the  pairs  of  the  original  and  permuted 
subgraphs,  and  report  the  mean  accuracy  and  runtime.  For  all  the  variants  of  BiG-Align,  we  set 
the  sparsity  penalty  A  =  0.1. 

How  do  we  compute  the  accuracy  of  the  methods?  For  the  state-of-the-art  methods,  which 
find  “hard”  alignments  between  the  nodes,  the  accuracy  is  computed  as  usual:  only  if  the  true 
correspondence  is  found,  the  corresponding  matching  is  deemed  correct.  In  other  words,  we  use 
the  state-of-the-art  algorithms  off-the-shelf.  For  our  method,  BiG-Align,  which  has  the  advantage 
of  finding  “soft”,  probabilistic  alignments,  we  consider  two  cases  for  evaluating  its  accuracy: 
(i)  Correct  Alignment.  If  the  true  correspondence  coincides  with  the  most  probable  matching,  we 
count  the  node  alignment  as  correct;  (ii)  Partially  Correct  Alignment.  If  the  true  correspondence  is 
among  the  most  probable  matchings  (tie),  the  alignment  thereof  is  deemed  partially  correct  and 
weighted  by  (#  of  nodes  in  tie)/  (total  #  of  nodes). 


Accuracy.  Figures  8.4(a)  and  (b)  present  the  accuracy  of  the  methods  for  two  different  graph 
sizes  and  varying  level  of  noise  in  the  permutations.  We  observe  that  BiG-Align  outperforms  all  the 
other  methods  in  most  cases  with  a  large  margin.  In  Figure  8.4(b),  the  only  exception  is  the  case 
of  20%  of  noise  in  the  900-nodes  graphs  where  NetAlign-deg  and  NetAlign-full  perform 
slightly  better  than  our  algorithm,  BiG-ALiGN-Exact.  The  results  for  other  graph  sizes  are  along 
the  same  lines,  and  therefore  are  omitted  for  space. 

Figure  8.5(a)  depicts  the  accuracy  of  the  alignment  methods  for  varying  graph  size.  For  graphs 
with  different  sizes,  the  variants  of  our  method  achieve  significantly  higher  accuracy  (70%-98%) 
than  the  baselines  (10%-58%).  Moreover,  surprisingly,  BiG-ALiGN-Skip  performs  slightly  bet¬ 
ter  than  BiG-ALiGN-Exact,  although  the  former  skips  several  updates  of  the  gradient  descent 
steps.  The  only  exception  is  for  the  smallest  graph  size,  where  the  consecutive  optimal  steps 
change  significantly  (Figure  8.2(a)),  and,  thus,  skipping  computations  affects  the  performance. 
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Accuracy  w.r.t.  noise  (50  nodes) 


Accuracy  w.r.t.  noise  (900  nodes) 

i 
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noise  (%) 

(a)  Graphs  of  50  nodes. 


°0  5  10  15 


noise  (%) 


(b)  Graphs  of  900  nodes. 


Figure  8.4:  [Higher  is  better.]  Accuracy  of  bipartite  graph  alignment  vs.  level  of  noise  (0-20%). 
BiG-ALiGN-Exact  (red  line  with  square  marker),  almost  always,  outperforms  the  baseline  methods. 

NetAlign-full  and  Umeyama’s  algorithm  are  the  least  accurate  methods,  while  nmf -based 
and  NetAlign-deg  achieve  medium  accuracy.  Finally,  the  accuracy  vs.  runtime  plot  in  Fig¬ 
ure  8.5(b)  shows  that  our  algorithms  have  two  desired  properties:  they  achieve  better  performance, 
faster  than  the  baseline  approaches. 

Runtime.  Figure  8.5(c)  presents  the  runtime  as  a  function  of  the  number  of  edges  in  the  graphs. 
Umeyama’s  algorithm  and  NetAlign-deg  are  the  fastest  methods,  but  at  the  cost  of  accuracy; 
BiG-Align  is  up  to  10 x  more  accurate  in  the  cases  that  it  performs  slower.  The  third  best  method 
is  BiG-ALiGN-Skip,  closely  followed  by  BiG-ALiGN-Exact.  BiG-ALiGN-Skip  is  up  to  174 x  faster  than 
the  NMF-based  approach,  and  up  to  19x  faster  than  NetAlign-full.  However,  our  simplest 
method  that  uses  line  search,  BiG-ALiGN-Points,  is  the  slowest  approach  and  given  that  it  takes  too 
long  to  terminate  for  graphs  with  more  than  1.5K  edges,  we  omit  several  data  points  in  the  plot. 

We  note  that  BiG-Align  is  a  single  machine  implementation,  but  it  has  the  potential  for  further 
speed-up.  For  example,  it  could  be  parallelized  by  splitting  the  optimization  problem  to  smaller 
subproblems  (by  decomposing  the  matrices,  and  doing  simple  column-row  multiplications). 
Moreover,  instead  of  the  basic  gradient  descent  algorithm,  we  can  use  a  variant  method,  the 
stochastic  gradient  descent,  which  is  based  on  sampling. 

Variants  of  BiG-Align.  Before  we  continue  with  the  evaluation  of  Uni-Align,  we  present 
in  Table  8.3  the  runtime  and  accuracy  of  all  the  variants  of  BiG-Align  for  aligning  movie-genre 
graphs  with  varying  sizes  and  permutations  with  noise  level  10%.  The  parameters  used  in  this 
experiment  are  e  =  10-5,  and  A  =  0.1.  For  BiG-ALiGN-Basic,  q  is  constant  and  equal  to  10  ~4,  while 
the  correspondence  matrices  are  initialized  uniformly.  This  is  not  the  best  setting  for  all  the  pairs  of 
graphs  that  we  are  aligning,  and  it  results  in  very  low  accuracy.  On  the  other  hand,  BiG-ALiGN-Skip 
is  not  only  ~  350x  faster  than  BiG-ALiGN-Points,  but  also  more  accurate.  Moreover,  it  is  ~  2x 
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Bipartite  Graphs:  Accuracy  Comparison 
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(a)  (Higher  is  better.)  Accuracy  of  align¬ 
ment  vs.  number  of  nodes. 


Bipartite  Graphs:  Accuracy  vs.  Runtime 
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Bipartite  Graphs:  Runtime  Comparison 
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Figure  8.5:  Accuracy  and  runtime  of  alignment  of  bipartite  graphs,  (a)  BiG-ALiGN-Exact  and  BiG- 
ALIGN-Skip  (red  lines)  significantly  outperform  all  the  alignment  methods  for  almost  all  the  graph 
sizes,  in  terms  of  accuracy;  (b)  BiG-ALiGN-Exact  and  BiG-ALiGN-Skip  (red  squares/ovals)  are  faster 
and  more  accurate  than  the  baselines  for  both  graph  sizes,  (c)  The  BiG-Align  variants  are  faster 
than  all  the  baseline  approaches,  except  for  Umeyama’s  algorithm. 
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Table  8.3:  Runtime  (top)  and  accuracy  (bottom)  comparison  of  the  BiG-Align  variants:  BiG-Align- 
Basic,  BiG-ALiGN-Points,  BiG-ALiGN-Exact,  and  BiG-ALiGN-Skip.  BiG-ALiGN-Skip  is  not  only  faster, 
but  also  comparably  or  more  accurate  than  BiG-ALiGN-Exact. 


BiG-ALiGN-Basic 

BiG-ALiGN-Points 

BiG-ALiGN-Exact 

BiG-ALiGN-Skip 

Nodes 

mean  std 

mean  std 

mean  std 

mean  std 

Runtime  (sec) 


50 

0.07 

0.00 

17.3 

0.05 

0.24 

0.08 

0.56 

0.01 

100 

0.023 

0.00 

1245.7 

394.55 

5.6 

2.93 

3.9 

0.05 

200 

31.01 

16.58 

2982.1 

224.81 

25.5 

0.39 

10.1 

0.10 

300 

0.032 

0.00 

5240.9 

30.89 

42.1 

1.61 

20.1 

1.62 

400 

0.027 

0.01 

7034.5 

167.08 

45.8 

2.058 

21.3 

0.83 

500 

0.023 

0.01 

- 

- 

57.2 

2.22 

36.6 

0.60 

600 

0.028 

0.01 

- 

- 

64.5 

2.67 

40.8 

1.26 

700 

0.029 

0.01 

- 

- 

73.6 

2.78 

44.6 

1.23 

800 

166.7 

1.94 

- 

- 

86.9 

3.63 

49.9 

1.06 

900 

211.9 

5.30 

- 

- 

111.9 

2.96 

61.8 

1.28 

Accuracy 


50 

0.071 

0.00 

0.982 

0.02 

0.988 

0 

0.904 

0.03 

100 

0.034 

0.00 

0.922 

0.07 

0.939 

0.06 

0.922 

0.07 

200 

0.722 

0.37 

0.794 

0.01 

0.973 

0.01 

0.975 

0.00 

300 

0.014 

0.00 

0.839 

0.02 

0.972 

0.01 

0.964 

0.01 

400 

0.011 

0.00 

0.662 

0.02 

0.916 

0.03 

0.954 

0.01 

500 

0.011 

0.00 

- 

- 

0.66 

0.20 

0.697 

0.24 

600 

0.005 

0.00 

- 

- 

0.67 

0.20 

0.713 

0.23 

700 

0.004 

0.00 

- 

- 

0.69 

0.20 

0.728 

0.19 

800 

0.013 

0.00 

- 

- 

0.12 

0.02 

0.165 

0.03 

900 

0.015 

0.00 

- 

- 

0.17 

0.20 

0.195 

0.22 

faster  than  BiG-ALiGN-Exact  with  higher  or  equal  accuracy.  The  speedup  can  be  further  increased 
by  skipping  more  updates  of  the  gradient  descent  steps. 

Overall,  the  results  show  that  a  naive  solution  of  the  optimization  problem,  such  as  BiG-Align- 
Basic,  is  not  sufficient,  and  the  optimizations  we  propose  in  Section  8.2  are  crucial  and  render  our 
algorithm  efficient. 


8.4.3  Evaluation  of  Uni-Align 

To  evaluate  our  proposed  method,  Uni-Align,  for  aligning  unipartite  graphs,  we  use  the  Facebook 
who-links-to-whom  graph  [VMCC  ],  which  consists  of  approximately  64K  nodes.  In  this  case, 
the  baseline  approaches  are  readily  employed,  while  our  method  requires  the  conversion  of  the 
given  unipartite  graph  to  bipartite.  We  do  so  by  extracting  some  unweighted  egonet4  features 
for  each  node  (degree  of  node,  degree  of  egonet5,  edges  of  egonet,  mean  degree  of  the  node’s 

4As  a  reminder,  egonet  of  a  node  is  the  induced  subgraph  of  its  neighbors. 

sThe  degree  of  an  egonet  is  defined  as  the  number  of  incoming  and  outgoing  edges  of  the  subgraph, 
when  viewed  as  a  super-node. 
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Unipartite  Graphs:  Accuracy  vs.  Runtime 
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Unipartite  Graphs:  Runtime  Comparison 
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Figure  8.6:  Accuracy  and  runtime  of  alignment  of  unipartite  graphs,  (a)  Uni-Align  (red  points)  is 
more  accurate  and  faster  than  all  the  baselines  for  all  graph  sizes,  (b)  Uni-Align  (red  squares)  is 
faster  than  all  the  baseline  approaches,  followed  closely  by  Umeyama’s  approach  (green  circles). 


neighbors).  As  before,  from  the  initial  graph  we  extract  subgraphs  of  size  100-800  nodes  (or 
equivalently,  264-6K  edges),  and  create  10  noisy  permutations  (per  noise  level)  as  before. 


Accuracy.  The  accuracy  vs.  runtime  plot  in  Figure  8.6(a)  shows  that  Uni-Align  outperforms  all 
the  other  methods  in  terms  of  accuracy  and  runtime  for  all  the  graph  sizes  depicted.  Although  NMF 
achieves  a  reasonably  good  accuracy  for  the  graph  of  200  nodes,  it  takes  too  long  to  terminate;  we 
stopped  the  runs  for  graphs  of  bigger  sizes  as  the  execution  was  taking  too  long.  The  remaining 
approaches  are  fast  enough,  but  yield  poor  accuracy. 
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Runtime.  Figure  8.6(b)  compares  the  graph  alignment  algorithms  with  respect  to  their  running 
time  (in  logscale).  Uni-Align  is  the  fastest  approach,  closely  followed  by  Umeyama’s  algorithm. 
NetAlign-deg  is  some  orders  of  magnitude  slower  than  the  previously  mentioned  methods. 
However,  NetAlign-full  ran  out  of  memory  for  graphs  with  more  than  2.8K  edges;  we  stopped 
the  runs  of  the  NMF-based  approach,  as  it  was  taking  too  long  to  terminate  even  for  small  graphs 
with  300  nodes  and  1.5K  edges.  The  results  are  similar  for  other  graph  sizes  that,  for  simplicity,  are 
not  shown  in  the  figure.  For  graphs  with  200  nodes  and  ~  1 . 1 K  edges  (which  is  the  biggest  graph 
for  which  all  the  methods  were  able  to  terminate),  Uni-Align  is  1.75  x  faster  than  Umeyama’s 
approach;  2x  faster  than  NetAlign-deg;  2, 927 x  faster  than  NetAlign-full;  and  31,709x 
faster  than  the  NMF-based  approach. 


8.5  Related  Work 

The  graph  alignment  problem  is  of  such  great  interest  that  there  are  more  than  150  publica¬ 
tions  proposing  different  solutions  for  it,  spanning  numerous  research  fields:  from  data  mining 
to  security  and  re-identification  [  ]  [  *09],  bioinformatics  [  LO  ]  [  ]  [  ]  , 

databases  [MGMR02],  chemistry  [SHL08],  vision,  and  pattern  recognition  [CFSV04].  Among 
the  suggested  approaches  are  genetic,  spectral,  clustering  algorithms  [BYH04,  QH06],  decision 
trees,  expecation-maximization  [  ],  graph  edit  distance  [  9],  simplex  [  >],  non-linear 

optimization  [  ],  iterative  HITS-inspired  [  ][  ],  and  probabilistic  [  ].  Some 

methods  that  are  more  efficient  for  large  graphs  include  a  distributed,  belief-propagation-based 
method  for  protein  alignment  [  0],  another  message-passing  algorithm  for  aligning  sparse 

networks  when  some  possible  matchings  are  given  [  ] .  We  note  that  all  these  works  are 

designed  for  unipartite  graphs,  while  we  focus  on  bipartite  graphs. 

One  of  the  well-known  approaches  is  Umeyama’s  near-optimum  solution  for  nearly-isomorphic 
graphs  [  ie88] .  The  graph  matching  or  alignment  problem  is  formulated  as  the  optimization 

problem 

mmP||PAPT-B||, 

where  P  is  a  permutation  matrix.  The  method  solves  the  problem  based  on  the  eigendecompositions 
of  the  matrices.  For  symmetric  n  x  n  matrices  A  and  B,  their  eigendecompositions  are  given  by 

A  =  UaAaUat 

B  =  UbAbUbT, 

where  UA  (Ub)  is  an  orthonormal6  matrix  whose  itH  column  is  the  eigenvector  V|  of  A  (B),  and  Aa 
(Ab)  is  the  diagonal  matrix  with  the  corresponding  eigenvalues.  When  A  and  B  are  isomorphic, 
the  optimum  permutation  matrix  is  obtained  by  applying  the  Hungarian  algorithm  [  ]  to  the 

6A  matrix  R  is  orthonormal  if  it  is  a  square  matrix  with  real  entries  such  that  RTR  =  RRT  =  I. 
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matrix  UbUat-  This  solution  is  good  only  if  the  matrices  are  isomorphic  or  nearly  isomorphic. 
Umeyama’s  approach  operates  on  unipartite,  weighted  graphs  with  the  same  number  of  nodes. 
Follow-up  works  employ  different  constraints  for  matrix  P.  For  example,  the  constraint  that  P  is  a 
doubly  stochastic  matrix  is  imposed  in  [  ]  and  [  ] ,  where  the  proposed  formulation, 

PATH,  is  based  on  convex  and  concave  relaxations. 

Ding  et  al.  [DU08]  proposed  a  Non-Negative  Matrix  Factorization  (NMF)  approach,  which 
starts  from  Umeyama’s  solution,  and  then  applies  an  iterative  algorithm  to  find  the  orthogonal 
matrix  P  with  the  node  correspondences.  The  multiplicative  update  algorithm  for  weighted, 
undirected  graphs  is: 


Pij 


Pu 


(APB)t 

(P<x)ij 


a 


PtAPB+(PtAPB)t 

2 


The  algorithm  stops  when  convergence  is  achieved.  At  that  point,  P  often  has  entries  that  are 
not  in  {0, 1},  so  the  authors  propose  applying  the  Hungarian  algorithm  for  the  bipartite  graph 
matching7.  The  runtime  complexity  of  the  NMF-based  algorithm  is  cubic  on  the  number  of  nodes. 

Bradde  et  al.  [  ]  proposed  distributed,  heuristic,  message-passing  algorithms,  based 

on  Belief  Propagation  [YFW03],  for  protein  alignment  and  prediction  of  interacting  proteins. 
Independently,  Bayati  et  al.  [BGSW1;  ]  formulated  graph  matching  as  an  integer  quadratic 
problem,  and  also  proposed  message-passing  algorithms  for  aligning  sparse  networks.  In  addition 
to  the  input  matrices  A  and  B,  a  sparse  and  weighted  bipartite  graph  L  between  the  vertices  of  A 
and  B  is  also  needed.  The  edges  of  graph  L  represent  the  possible  node  matchings  between  the 
two  graphs  and  their  weights  capture  the  similarity  between  the  connected  nodes) .  Singh  et  al. 
[SXBO  ]  had  proposed  the  use  of  the  full  bipartite  graph  earlier.  However,  as  we  have  shown  in  our 
experiments,  this  variation  has  high  memory  requirements  and  does  not  scale  well  for  large  graphs. 
A  related  problem  formulation  was  studied  by  Klau  [Kla09],  who  proposed  a  Lagrangian  relaxation 
approach  combined  with  branch-and-bound  to  align  protein-protein  interaction  networks  and 
classify  metabolic  subnetworks. 

In  all  these  works,  the  graphs  that  are  studied  are  unipartite,  while  we  focus  on  bipartite 
graphs,  and  also  propose  an  extension  of  our  method  to  handle  unipartite  graphs. 


8.6  Discussion 

The  experiments  show  that  BiG- Align  efficiently  solves  a  problem  that  has  been  neglected  in 
the  literature:  the  alignment  of  bipartite  graphs.  Given  that  all  the  efforts  have  been  targeted  at 
aligning  uni-partite  graphs,  why  does  matching  bipartite  graphs  deserve  being  studied  separately? 

7We  note  that  this  refers  to  the  graph  theoretical  problem  which  is  a  special  case  of  the  network  flow 
problem.  It  does  not  refer  to  graph  alignment. 
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Firstly,  bipartite  networks  are  omnipresent:  users  like  webpages,  belong  to  online  communities, 
access  shared  files  in  companies,  post  in  blogs,  co-author  papers,  attend  conferences,  etc.  All  these 
settings  can  be  modeled  as  bipartite  graphs.  Secondly,  although  it  is  possible  to  turn  them  into 
unipartite  and  apply  an  off-the-shelf  algorithm,  as  shown  in  the  experiments,  knowledge  of  the 
specific  structural  characteristics  can  prove  useful  in  achieving  alignments  of  better  quality.  Lastly, 
this  problem  enables  emerging  applications.  For  instance,  one  may  be  able  to  link  the  clustering 
results  from  different  networks  by  applying  soft  clustering  on  the  input  graphs,  and  subsequently 
our  method  on  the  obtained  node-cluster  membership  graphs. 

Although  the  main  focus  of  our  work  is  bipartite  graph  alignment,  the  latter  inspires  an 
alternative  way  of  matching  unipartite  graphs,  by  turning  them  into  bipartite.  Therefore,  we 
show  how  our  framework  can  handle  any  type  of  input  graphs,  without  any  restrictions  on  their 
structure.  Moreover,  it  can  be  applied  even  to  align  clouds  of  points;  we  can  extract  features  from 
the  points,  create  a  point-to-feature  graph,  and  apply  BiG-Align  to  the  latter. 

Finally,  is  our  approach  simply  gradient  descent?  The  answer  is  negative;  gradient  descent  is 
the  core  of  our  algorithm,  but  the  projection  technique,  appropriate  initialization  and  choice  of  the 
gradient  step,  as  well  as  careful  handling  of  known  graph  properties  are  the  critical  design  choices 
that  make  our  algorithm  successful  (as  shown  in  Section  8.4,  where  we  compare  our  method  to 
simple  gradient  descent,  BiG-ALiGN-Basic). 


8.7  Summary 

In  this  chapter  we  have  studied  the  problem  of  graph  matching  for  an  important  class  of  real 
graphs,  bipartite  graphs.  Our  contributions  can  be  summarized  as  follows: 

1.  Problem  Formulation:  We  have  introduced  a  powerful  primitive  with  new  constraints  for 
the  graph  matching  problem. 

2.  Effective  and  Scalable  Algorithm:  We  have  proposed  an  effective  and  efficient  algorithm, 
BiG-Align,  based  on  gradient  descent  (APGD)  to  solve  our  constrained  optimization  prob¬ 
lem  with  careful  handling  of  many  subtleties.  We  have  also  given  a  generalization  of  our 
approach  to  align  unipartite  graphs  (Uni-Align). 

3.  Experiments  on  Real  Graphs:  Our  experiments  have  shown  that  BiG-Align  and  Uni- 
Align  are  superior  to  state-of-the-art  graph  matching  algorithms  in  terms  of  both  accuracy 
and  efficiency,  for  bipartite  as  well  as  unipartite  graphs. 

Future  work  includes  extending  our  problem  formulation  to  subgraph  matching  by  revisiting 
the  initialization  of  the  correspondence  matrices. 
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Part  III 

Conclusions  and  Future  Directions 
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Chapter  9 
Conclusion 


Graphs  are  very  powerful  representations  of  data  and  the  relations  among  them.  The  web, 
friendships  and  communications,  collaborations  and  phonecalls,  traffic  flow,  or  brain  functions 
are  only  few  examples  of  the  processes  that  are  naturally  captured  by  graphs,  which  often  span 
hundreds  of  millions  or  billions  of  nodes  and  edges.  Within  this  abundance  of  interconnected  data, 
a  key  challenge  is  the  extraction  of  useful  knowledge  in  a  scalable  way. 

This  thesis  focuses  on  fast  and  principled  methods  for  the  exploratory  analysis  of  a  single  or 
multiple  networks  in  order  to  gain  insights  into  the  underlying  data  and  phenomena.  The  main 
thrusts  of  our  work  in  exploring  and  making  sense  of  one  or  more  graphs  are: 


•  Summarization  -  The  goal  is  to  provide  a  compact  and  interpretable  representation  of  one 
or  more  graphs,  and  nodes’  behaviors.  We  focus  on  summarizing  static  and  time-evolving 
graphs  by  using  easy-to-understand  (static  or  temporal)  motifs. 


•  Similarity  -  The  aim  is  to  discover  clusters  of  nodes  or  graphs  with  related  properties.  At 
the  node  level,  we  contribute  an  analysis  of  guilt-by-association  methods.  At  the  graph  level, 
we  propose  algorithms  for  computing  the  similarity  between  node-aligned  or  not-aligned 
networks,  the  requirements  of  which  are  scalability  and  the  interpretability  of  the  results. 


In  both  thrusts,  our  theoretical  underpinnings  and  scalable  algorithmic  approaches  (near-linear 
in  the  number  of  edges)  exploit  the  sparsity  in  real-world  data,  and  at  the  same  time  handle  the 
noise  and  missing  values  that  often  occur  in  it.  Our  methods  are  powerful,  can  be  applied  in 
various  settings  where  the  data  can  be  represented  as  a  graph,  and  have  applications  ranging  from 
anomaly  detection  in  static  and  dynamic  graphs  (e.g.,  email  communications  or  computer  network 
monitoring),  to  clustering  and  classification,  to  re-identification  across  networks  and  visualization. 
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9.1  Single-graph  Exploration 

To  make  sense  of  a  single  large-scale  graph  and  its  underlying  phenomena,  we  focus  on  two  main 
questions  and  propose  fast  algorithmic  approaches  that  leverage  methods  from  matrix  algebra, 
graph  theory,  information  theory,  machine  learning,  finance,  and  social  science. 

•  “How  can  we  succinctly  describe  a  large-scale  graph?”:  We  contribute  VoG,  a  method 
that  combines  graph  and  information  theory  to  provide  an  interpretable  summary  of  a 
given  unweighted,  undirected  graph  in  a  scalable  way.  It  enables  visualization  and  guides 
attention  to  important  structures  within  graphs  with  billions  of  nodes  and  edges.  Application 
of  VoG  to  a  variety  of  real-world  graphs  (such  as  co-edit  Wikipedia  graphs)  showed  the 
existence  of  structure  in  real-world  networks,  and  the  effectiveness  and  efficiency  of  our 
method. 

•  “What  can  we  learn  about  all  the  nodes  given  prior  information  for  a  subset  of  them?”: 

We  focused  on  belief  propagation  for  inference  in  graphical  models  under  a  semi-supervised 
multi-class  setting  and  proposed  methods  that  handle  heterophily  and  homophily  in  at¬ 
tributes.  By  starting  from  the  original,  iterative  belief  propagation  equations,  we  derived 
a  closed  formula  and  gave  convergence  guarantees  (the  convergence  of  the  original  loopy 
belief  propagation  is  not  well  understood).  We  also  contribute  a  fast  and  accurate  algorithm 
for  binary  classification,  FaBP,  based  on  our  derived  closed  formula,  and  showed  the  equiva¬ 
lence  of  belief  propagation,  random  walks  with  restarts  and  graph-based  semi-supervised 
learning.  We  extended  FaBP  to  a  multi-class  setting,  and  provided  a  fast  algorithm,  LinBP, 
which  is  also  based  on  a  closed-form  equation,  and  has  exact  convergence  guarantees.  Our 
evaluation  included  node  classification  on  a  web  snapshot  collected  by  Yahoo  with  over  6 
billion  links  between  webpages  in  a  distributed  setting  (Yahoo’s  M45  Hadoop  cluster  with 
480  machines). 


9.2  Multiple-graph  Exploration 

To  make  sense  of  multiple  large-scale  graphs  (which  are  disparate  or  exhibit  temporal  dependen¬ 
cies)  and  their  underlying  phenomena,  we  focus  on  three  main  questions  and  propose  scalable 
and  principled  algorithmic  approaches  that  draw  methods  from  matrix  algebra,  graph  theory, 
information  theory,  optimization  and  machine  learning. 

•  “How  can  we  succinctly  describe  a  set  of  large-scale,  time-evolving  graphs?”:  We 
extended  our  Minimum  Description  Length-based  method  for  exploring  and  summarizing  a 
single  graph  to  the  setting  of  multiple  temporal  graphs  and  contributed  TimeCrunch,  which 
provides  an  interpretable  way  of  summarizing  unweighted,  undirected  dynamic  graphs 
by  using  a  suitable  lexicon  (e.g.,  sparkling  star,  one-shot  bipartite  core).  The  power  of 
the  method  is  in  detecting  temporally  coherent  subgraphs  which  may  not  appear  at  every 
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timestep,  and  providing  interpretable  results.  It  also  enables  the  visualization  of  dynamic 
graphs  with  hundreds  of  millions  of  nodes  and  interactions  between  them.  The  application 
of  TimeCrunch  to  real-world  graphs  (email  exchange,  instant  messaging  and  phone-call 
networks,  computer  network  attacks  and  co-authorship  graphs)  shows  that  time-evolving 
networks  can  be  represented  compactly  by  exploiting  their  temporal  structures. 

•  “What  is  the  similarity  between  two  graphs?  Which  nodes  and  edges  are  responsible 
for  their  difference?”:  We  contributed  DeltaCon,  which  assesses  the  similarity  between 
aligned  networks  and  guides  attention  to  the  regions  (nodes  or  edges)  that  differ  between 
them.  The  key  idea  behind  DeltaCon  is  to  capture  the  node  influences  by  adapting  FaBP 
(which  was  originally  used  for  node  inference).  The  strength  of  the  method  is  that  it 
combines  global  and  local  graph  structures  to  evaluate  the  similarity  between  two  networks, 
and  it  does  not  extract  a  single  graph  feature  like  other  methods  in  the  literature.  Moreover, 
we  contributed  a  series  of  axioms  and  properties  that  a  graph  similarity  method  should 
satisfy  in  order  to  agree  with  our  intuition,  and  showed  that  DeltaCon  satisfies  them 
analytically  and  experimentally.  Application  of  our  method  to  real-world  applications  led 
to  several  interesting  discoveries  including  the  significant  difference  in  brain  connectivity 
between  creative  and  non-creative  people. 

•  “How  can  we  efficiently  align  two  bipartite  or  unipartite  graphs?”:  We  introduced  the 
formulation  of  alignment  for  bipartite  graphs  with  practical  constraints  (sparsity  of  matching 
and  probabilistic  alignment).  We  contributed  a  gradient  descent-based  approach  with  a 
series  of  real-world-network-inspired  optimizations,  BiG-Align,  which  is  up  to  100 x  faster 
and  2  —  4x  more  accurate  than  competitor  alignment  methods.  In  addition  to  that,  based 
on  our  bipartite  graph  alignment  method,  we  introduced  an  alternative  way  of  aligning 
unipartite  graphs  which  outperforms  the  state-of-the-art  approaches:  our  approach  converts 
a  unipartite  graph  to  bipartite  (through  clustering  or  feature  extraction),  and  leverages 
BiG-Align  to  find  the  node  correspondence  efficiently. 


9.3  Impact 

The  work  has  broad  impact  on  a  variety  of  applications:  anomaly  detection  in  static  and  dynamic 
graphs,  clustering  and  classification,  cross-network  analytics,  re-identification  across  networks, 
and  visualization  in  various  types  of  networks,  including  social  networks  and  brain  graphs.  It  has 
been  used  in  academic  and  industrial  settings: 

•  Taught  in  graduate  classes:  Our  methods  on  node  inference  (FaBP,  LinBP),  graph  similar¬ 
ity  (DeltaCon),  and  graph  summarization  (VoG)  are  being  taught  in  graduate  courses  - 
e.g.,  Rutgers  University  (16:198:672),  Tepper  School  at  CMU  (47-953),  Saarland  University 
(Topics  in  Algorithmic  Data  Analysis),  and  Virginia  Tech  (CS  6604). 
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•  Used  in  the  reed  world: 

■  Our  summarization  work  (VoG  [KKVF14,  KKVF1  ]),  similarity  algorithm  (DeltaCon 

[  ,  ]),  and  fast  anomaly  detectors  (G-FADD  [  ],  NetRay  [  ]) 

are  used  in  DARPA’s  Anomaly  Detection  at  Multiple  Scales  project  (ADAMS)  to  detect 
insider  threats  and  exfiltration  in  the  government  and  the  military. 

■  Seven  patents  have  been  filed  at  IBM  on  our  graph  alignment  method.  One  of  them 
received  rate-1  (the  top  rating  corresponding  to  “extremely  high  potential  business 
value  for  IBM”) . 

•  Awards:  VoG  [KKVF1  ]  was  selected  as  one  of  the  best  papers  of  SDM’14. 
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Chapter  10 

Vision  and  Future  Work 

The  overarching  theme  of  our  research  is  developing  scalable  algorithms  for  understanding 
large  graphs  and  their  underlying  processes.  In  this  work,  we  have  taken  several  steps  toward 
automating  the  sense-making  of  large,  real-world  networks.  Next  we  outline  some  of  the  research 
directions  stemming  from  our  work. 


10.1  Theory:  Unifying  summarization  and  visualization 
of  large  graphs 

The  continuous  generation  of  interconnected  data  creates  the  need  for  summarizing  them  to  extract 
easy-to-understand  information.  Our  work  proposed  a  new,  alternative  way  of  summarizing  large 
undirected  and  unweighted  graphs.  A  natural  next  step  is  design  of  principled  approaches  for 
summarizing  graphs  to  address  a  variety  of  different  needs,  such  as  time-evolving  graphs,  networks 
with  side  information,  directed  and/or  weighted  graphs,  summarization  of  anomalies,  hierarchical 
or  interactive  summarization.  We  claim  that  visualization  is  the  other  side  of  the  same  coin; 
summarization  aims  at  minimizing  the  number  of  bits,  and  visualization  the  number  of  pixels. 
Currently,  visualization  of  large  graphs  is  almost  impossible.  It  is  based  on  global  structures  and 
yields  an  uninformative  clutter  of  nodes  and  edges.  Formalizing  graph  visualization  by  unifying 
it  with  information  theory  is  interesting  and  beneficial  for  making  sense  of  large  amounts  of 
networked  data. 

10.2  Systems:  Big  data  systems  for  scalability 

Scalability  will  continue  to  be  an  integral  part  of  real-world  applications.  Hadoop  is  appropriate 
for  several  tasks,  but  falls  short  when  there  is  need  for  real-time  analytics,  iterative  approaches, 
and  specific  structures.  Moving  forward,  it  would  be  beneficial  to  explore  other  big  data  systems 
(e.g.,  Spark,  GraphLab,  Cloudera  Impala)  that  match  the  intrinsic  nature  of  the  data  at  hand  and 
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special  requirements  of  the  methods.  Moreover,  investigating  how  to  exploit  the  capabilities  of  big 
data  systems  to  provide  the  analyst  with  fast,  approximate  answers,  which  will  then  be  refined 
further  depending  on  the  time  constraints,  can  contribute  to  even  more  scalable  approaches  to  sift 
through  and  understand  the  ever-growing  amounts  of  interconnected  data. 


10.3  Application:  Understanding  brain  networks 

This  work  includes  analysis  of  brain  graphs  and  several  fascinating  scientific  discoveries,  which 
is  just  an  initial  step  in  the  realm  of  analyzing  brain  networks  to  understand  how  the  brain 
works.  Many  questions  about  the  function  of  the  brain  and  mental  diseases  remain  unanswered, 
despite  the  huge  economic  cost  of  neurological  disorders.  Scalable  and  accurate  graph-based 
computational  methods  can  contribute  to  the  brain  research,  which  is  supported  by  the  US 
government  through  the  Brain  Research  through  Advancing  Innovative  Neurotechnologies  (BRAIN) 
Initiative,  and  shed  light  to  questions  such  as:  How  do  an  individual’s  brain  connections  change 
over  time  when  she  is  developing  Altzheimer  Disease  (AD)?  How  does  the  brain  connectivity  differ 
between  autistic  and  healthy  subjects?  Is  it  possible  to  detect  neurological  diseases  and  syndromes 
from  the  changes  that  occur  in  a  brain  graph? 

In  conclusion,  understanding,  mining  and  managing  large  graphs  have  numerous  high-impact 
applications,  and  fascinating  research  challenges. 
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